# Chapter 4 - Series Introduction

In [1]:
series = {
    'index': [0,1,2,3],
    'data': [145,142,38,13],
    'name': 'songs'
}

def get(series, idx):
    value_idx = series['index'].index(idx)
    return series['data'][value_idx]

get(series,1)

142

In [2]:
# The double abstraction is used in pandas because it allows for other
# data types in the index

songs = {
    'index': ['Paul','John','George','Ringo'],
    'data': [145,142,38,13],
    'name': 'counts'
}

get(songs,'John')

142

In [3]:
# Creating a series in pandas
import pandas as pd

songs2 = pd.Series([145,142,38,13],
            name='counts')
songs2.index

RangeIndex(start=0, stop=4, step=1)

In [4]:
import numpy as np

songs3 = pd.Series([145,142,38,13],
            name='counts',
            index=['Paul','John','George','Ringo'],
            )

# .count() method returns the non-null values in a series
print(songs3.count())

# inspecting the .size property gives you the number of entries
print(songs3.size)

# 'Int64' is a datatype in pandas that allows Null types as an alternative to
# letting pandas convert int64 to a float (which uses more memory).
songs3.astype('Int64')

4
4


Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: Int64

### 4.5 Similar to NumPy

In [5]:
# The Series object behaves similarly to a NumPy array. Both respond to 
# the index operations.
numpy_ser = np.array([145,142,38,13])
print(songs2[1])
print(numpy_ser[1])

# They both have methods in common:
print(songs2.mean())
print(numpy_ser.mean())

142
142
84.5
84.5


They also both have a notion of a boolean array. A boolean array is a series with the same index as the series you are working with that has boolean values, and it can be used as a mask to filter out items. Normal Python lists do not support such fancy index operations, like sticking a list into an index operation.

In [6]:
# Make a 'mask' using pandas series (what is a mask?)
mask = songs3 > songs3.median() # boolean array
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

Once we have a mask, we can use that as a filter. We just need to pass the mask into an index operation. If the mask has a 'True' value for a given index, the value is kept. Otherwise, the value is dropped. The mask above represents the locations that have a value higher than the medion value af the series.

In [7]:
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

In [8]:
# NumPy can also filter by boolean arrays, but lacks the .median method on an array.
# Instead NumPy provides a median function in the NumPy namespace.
# The equivalent version in NumPy:

numpy_ser[numpy_ser > np.median(numpy_ser)]

array([145, 142])

### 4.6 Categorical Data
You can load data as categorical if you know it is limited to only a few values.
##### Benefits:
- Use less memory than strings
- Improve performance
- Can have an ordering
- Can perform operations or categories
- Enforce membership on values

In [9]:
# To create category:
s = pd.Series(['m','l','xs','s','xl'], dtype='category')
s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

In [10]:
# by default categories do not have an ordering:
# The cat attribute has various properties
s.cat.ordered

False

In [11]:
# To canvert a non-categorical series to an ordered category, we can create a type
# with the 'CategoricalDtype' constructor and the appropriate parameters.
# Then we pass this type into the .astype method:

s2 = pd.Series(['m','l','xs','s','xl'])
size_type = pd.api.types.CategoricalDtype(
    categories=['s','m','l'], ordered=True)
s3 = s2.astype(size_type)
s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

In [12]:
# The categoris not in the CategoricalDtype were replaced with NaN.
# Perform comparisonns on the ordered categories now:
s3 > 's'

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [13]:
# We can add ordering information to existing categorical data:
# You have to include all categories or pandas will throuw a ValueError

s.cat.reorder_categories(['xs','s','m','l','xl'], ordered=True)

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

In [14]:
# String and datetime series have a str and dt attribute that allow us to perform 
# commen operations specific to thet type. If we convert these types to
# categorical types, we can still use the str or dt attributes[method?] on them:
s3.str.upper()

0      M
1      L
2    NaN
3      S
4    NaN
dtype: object

### Series Introduction
pd.series(data=None, index=None, dtype=None, name=None) <br>
Create a series from data (sequeence, dictionary, or scalar) <br>
<br>
s.index <br>
Access index of series <br>
<br>
s.astype(dtype, errors='raise') <br>
Cast a series to dtype. To ignore errors (and return original object) use errors='ignore' <br>
<br>
s[boolean_array] <br>
Return values from s where boolean_array is True <br>
<br>
s.cat.ordered <br>
Determine if a categorical series is ordered <br>
<br>
s.cat.reorder_categories(new_categories, ordered=False) <br>
Add categories (potentially ordered) to the series. new_categories must include all categories.


### Exercises
1. Create a series with temperature values for the last seven days. Filter out values below the mean.
2. Create a series with colors. Use categorical type.

In [15]:
temperatures = pd.Series(data=[78,63,66,67,65,61,56], name='temperature')
mean_temp = temperatures > temperatures.mean()
temperatures[mean_temp]

0    78
2    66
3    67
Name: temperature, dtype: int64

In [16]:

colors = pd.Series(data=['Blue','Green','Red','Orange'], 
                    dtype='category', 
                    name='colors')

colors

0      Blue
1     Green
2       Red
3    Orange
Name: colors, dtype: category
Categories (4, object): ['Blue', 'Green', 'Orange', 'Red']

# Chapter 5 - Series Deep Dive

### 5.1 Loading the Data
US Fuel Economy Data

In [17]:
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08

  df = pd.read_csv(url)


In [18]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [19]:
highway_mpg

0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

In [20]:
# The dir function in pandas will list all the attributes available on an object.

len(dir(city_mpg))

420

#### Series Attributes
- Dunder methods (.__add__, .__itr__, etc.) provide many numeric operations, looping, attribute access, and index access. For the numeric operations, these return a series.
- Corresponding operator methods for many of the numeric operations allow us to tweak the behavior(there is a .add method in addition to .__add__)
- Aggregate methods and properties which reduce or aggregate the values in a series down to a single scalar value. The .mean., .max, and .sum methods and .is_monotonic property are all examples.
- Conversion methods. Some of these start with .to_ and export the data to other formats.
- Manipulation methods such as .sort_values, .drop_duplicates, that return Series objects with the same index.
- Indexing and accessor methods and attributes such as .loc and .iloc. These return Series or scalars.
- String manipulation methods using .str.
- Date manipulation methods using .dt.
- Plotting methods using .plot.
- Categorical manipulation methods using .cat.
- Transformation methods such as .unstack and .reset_index, .agg, .transform.
- Attributes such as .index and .dtype.
- A bunch of private attributes that we will ignore (around 130 of them).

In [21]:
# How many attributes are faund on the .str attribute
ds = pd.Series(data=['hello','goodbye'])
ds2 = pd.Series(pd.date_range('2000-01-01',periods=3,freq='s'))
print(len(dir(ds.str)))
print(len(dir(ds2.dt)))

99
83


# Chapter 6 - Operators & Dunder Methods

### 6.2 Dunder Methods

In [22]:
2+4

6

In [23]:
# Under the covers
(2).__add__(4)

6

In [24]:
# A Python integer object that has a .__add__ method respond to the + operation.
# Because a Series object has this method, you can call + on it. There is also
# a .__div__ method that supports division. 
# 
# One way to calculate the average of the average of the two series:

(city_mpg + highway_mpg) / 2 

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

#### 6.3 Index Alignment

When you operate with two series pandas will align the index before performing the operation. Aligning will take tach index entry in the left series and match it up with EVERY entry with the same index of the right series. Because of this the indexes should be:
- unique (no duplicates)
- common to both series<br>

Either of these will cause create nan values or a combinatoric explosion respectively.

In [25]:
s1 = pd.Series([10,20,30],index=[1,2,2])
s2 = pd.Series([35,44,53],index=[2,2,4])
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

### 6.4 Broadcasting
__What is scalar and example:__<br>
Scalar, a physical quantity that is completely described by its magnitude. Examples of scalars are volume, density, speed, energy, mass, and time. Other quantities, such as force and velocity, have both magnitude and direction and are called vectors.<br>

When you perform math operations with a scalar, pandas broadcasts the operation to all values. In the above case, the values are added together. This makes it easy to write mathematical operations. It also makes the code easy to read.<br>

With many math operations, these are optimized and happen quickly in the CPU. This is called vectorization. CPUs leverage a technorogy called Single Instruction/Multiple Data (SIMD) to apply math operations to a block of memory.



### 6.5 Iteration
There is an .__iter__ method but it should usually not be used. This does not use vectorization and C and so you loose important benefits of pandas.

### 6.6 Operator Methods
Why are there methods and operators?<br>
In general, functions and methods have parameters to allow you to parameterize or change the behavior based on the parameters. The dunder methods generally fill in NaN when one of the operands is missing following index alignment. The operator methods have a fill_value parameter that changes this behavior. If one of the operands is missing, it will use the fill_value instead.<br>

If we call the .add method with the default parameters, we will have the same result as the + operator:

In [26]:
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [27]:
s1.add(s2)

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [28]:
# However, we can use the fill_value paramater to specify that we use zero instead:

s1.add(s2, fill_value=0)

1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

### 6.7 Chaining
Another stylistic reason to prefer the method to the operator is that it makes chaining manipulations easier. Because most pandas methods do note mutate data in place but instead return a new object. We will see many examples of this. Chaining makes the code easy to read and understand. We can chain with operators as well, but it requires that we wrap the operation with parentheses.<br>

In [29]:
# Below we calculate the average of city and highway mileage using operators:

(city_mpg + highway_mpg)/2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

In [30]:
# Here is an example of chaining to calculate the average:

(city_mpg
    .add(highway_mpg)
    .div(2)
    )

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

This is a simple example, but chaining can lead to understanding your code. I like to put these operations in their own line. I read this as <br>
"we are taking the city_mpg series, then we are adding the highway_mpg series to it. Finally we are dividing by two."

# Chapter 7 - Aggregate Methods

Aggregate methods collapse the values of a series down to a scalar. Aggregations are the numbers that your boss wants to be reported.

### 7.1 - Aggregations

calulate mean by using an aggregation method, .mean:

In [31]:
city_mpg.mean()

18.369045304297103

There are also a few aggregate properties. Thes start with .is_ 
you don't call them, they evaluate to true or false.

In [32]:
city_mpg.is_monotonic_increasing

False

In [34]:
city_mpg.quantile()

17.0

In [35]:
city_mpg.quantile(.9)

24.0

In [36]:
city_mpg.quantile([.1,.5,.9])

0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: float64

### 7.2 Cont and Mean of an Attribute
Neat trick, if you want the count of values that meet some criteria, you can use the .sum method. <br>
For example to get cout and % of cars with milege greater than 20, we can use:<br>

DataFrame.gt(other, axis='columns', level=None)[source]<br>
Get Greater than of dataframe and other, element-wise (binary operator gt).

In [41]:
city_mpg.gt(20).sum()

10272

In [42]:
# If you want the % of values that meet some criteria you can apply the .mean method
city_mpg.gt(20).mul(100).mean()

24.965973167412017

This trick comes from the fact that Python treats True as 1 and False as 0.<br>
If you sum up a series of boolean values, the result is the count of True values.<br>
If you take the mean of a series of boolean values, the result is the fraction of values that are true.

### 7.3 .agg and Aggregation Strings
Finall, the .agg method takes in a string and transforms the data depending on how it was called.<br>
It shines in the ability to perform multiple aggregations at once.<br>
NumPy reduction functions, Python aggregations, or define your own.

In [45]:
import numpy as np
def second_to_last(s):
    return s.iloc[-2]

city_mpg.agg(['mean',np.var,max,second_to_last])

mean               18.369045
var                62.503036
max               150.000000
second_to_last     18.000000
Name: city08, dtype: float64

# Chapter 8 - Conversion Methods
Sometimes you will need to change the type of data. This may be do to formats that do not include type information, or it may be that you can have better performance by changing types.

### 8.1 - Automatic Conversion
.convert_dtypes > This tries to convert a Series to a type that supports pd.NA.<br>
In the city_mpg it will change the type from int64 to Int64:

In [46]:
city_mpg.convert_dtypes()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64