# Series
## Data import

In [1]:
import pandas as pd

url = "https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip"

df = pd.read_csv(url, dtype_backend="pyarrow", engine="pyarrow")
city_mpg = df.city08

highway_mpg = df.highway08

In [2]:
# You can use the `+` operator to add two Series together.
print((city_mpg + highway_mpg)/2)

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: double[pyarrow]


In [3]:
# You can use the `quantile` method to get specific quantiles of a Series.
city_mpg.quantile([.1, .5, .9])

0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: double[pyarrow]

In [4]:
city_mpg.gt(20).astype("int64[pyarrow]").mul(100).mean()

24.965973167412017

In [5]:
city_mpg.gt(20).sum()

10272

In [6]:
city_mpg.agg('mean')

18.369045304297103

In [8]:
cyl = df.cylinders
cyl.isna().sum()

np.int64(206)

In [9]:
make = df.make
missing = cyl.isna()

make.loc[missing]


7138     Nissan
7139     Toyota
8143     Toyota
8144       Ford
8146       Ford
          ...  
34563     Tesla
34564     Tesla
34565     Tesla
34566     Tesla
34567     Tesla
Name: make, Length: 206, dtype: string[pyarrow]

## Exercises
With a dataset of your choice:

1. Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using .apply.
2. Create a series from a numeric column that has the value of 'high' if it is equal to or above the mean and 'low' if it is below the mean using .case_when.
3. Time the differences between the previous two solutions to see which is faster.
4. Replace the missing values of a numeric series with the median value.
5. Clip the values of a numeric series between to 10th and 90th percentiles.
6. Using a categorical column, replace any value that is not in the top 5 most frequent values with 'Other'.
7. Using a categorical column, replace any value that is not in the top 10 most frequent values with 'Other'.
8. Make a function that takes a categorical series and a number (n) and returns a replace series that replaces any value not in the top n most frequent values with 'Other'.
9. Using a numeric column, bin it into 10 groups with the same width.
10. Using a numeric column, bin it into 10 groups that have equal-sized bins.

In [13]:
mpg_mean = city_mpg.mean()
cond = city_mpg > mpg_mean

city_mpg.case_when(caselist=[
    (cond, "high"),
    (pd.Series(True, index=city_mpg.index), 'low')
])

0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Name: city08, Length: 41144, dtype: object

In [14]:
def mean_comparison(val):
    return 'high' if val >= mpg_mean else 'low'

city_mpg.apply(mean_comparison)

0        high
1         low
2        high
3         low
4         low
         ... 
41139    high
41140    high
41141     low
41142     low
41143     low
Name: city08, Length: 41144, dtype: object

In [15]:
city_mpg.clip(lower=city_mpg.quantile(.1), upper=city_mpg.quantile(.9))

0        19
1        13
2        23
3        13
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64[pyarrow]

In [17]:
pd.qcut(city_mpg, 10)

0         (18.0, 20.0]
1        (5.999, 13.0]
2         (21.0, 24.0]
3        (5.999, 13.0]
4         (16.0, 17.0]
             ...      
41139     (18.0, 20.0]
41140     (18.0, 20.0]
41141     (17.0, 18.0]
41142     (17.0, 18.0]
41143     (15.0, 16.0]
Name: city08, Length: 41144, dtype: category
Categories (10, interval[float64, right]): [(5.999, 13.0] < (13.0, 14.0] < (14.0, 15.0] < (15.0, 16.0] ... (18.0, 20.0] < (20.0, 21.0] < (21.0, 24.0] < (24.0, 150.0]]