Pandas `read_csv()` function can accept URLs and zip files

In [6]:
import pandas as pd
from rich import print as rprint
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url, dtype_backend='pyarrow', engine='pyarrow')
city_mpg = df.city08
highway_mpg = df.highway08
len(dir(city_mpg))




420

In [7]:
(city_mpg + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: double[pyarrow]

The index entries align before operating. If they are not unique, you will get a combinatoric explosion of index entries.

In [10]:
s1 = pd.Series([10,20,30], index=[1,2,2])
s2 = pd.Series([35,44,53], index=[2,2,4], name='s2')
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In general, functions and methods have parameters to allow you to *parameterize* or change behavior based on parameters. For example, default behavior is to fill in `NAN` or `NA` (`Int64`) when an operand is missing following index alignment, but operator methods have a `fill_value` parameter

In [11]:
s1.add(s2)

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [12]:
s1.add(s2, fill_value=0)

1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

Here is an example of chaining to calculate the average of city and highway mileage:

In [14]:
(city_mpg.radd(highway_mpg)).div(2)

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: double[pyarrow]

In [15]:
city_mpg.mean()

18.369045304297103

In [16]:
city_mpg.quantile()

17.0

In [17]:
city_mpg.quantile(.9)

24.0

In [19]:
city_mpg.quantile([.1, .5, .9])

0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: double[pyarrow]

In [20]:
(city_mpg.gt(20).sum())

10272

In [27]:
(city_mpg.gt(20).astype(int).mul(100).mean())

np.float64(24.965973167412017)

You can use `.agg` to calculate the mean:

In [28]:
city_mpg.agg('mean')

18.369045304297103

In [30]:
import numpy as np
def second_to_last(s):
    return s.iloc[-2]
city_mpg.agg(['mean', np.var,max,second_to_last])

  city_mpg.agg(['mean', np.var,max,second_to_last])
  city_mpg.agg(['mean', np.var,max,second_to_last])


mean               18.369045
var                62.503036
max               150.000000
second_to_last     18.000000
Name: city08, dtype: float64

`.astype` to specify type for a series. The max mpg for some cars is 150 so `int8[pyarrow]` won't work

In [31]:
city_mpg.astype('int8[pyarrow]')

ArrowInvalid: Integer value 132 not in range: -128 to 127

In [32]:
city_mpg.astype('int16[pyarrow]')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int16[pyarrow]

Inspect limits on integer and float types:

In [33]:
import numpy as np
np.iinfo('int64')

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)