### Series Overview Attributes and Methods


| _Method_ | _Description_ |
| :------- | :------------ |
|pd.Series(data=None, index=None, dtype=None, name=None, copy=Flase)| create a series fro data (sequence, dictionary, or scalar)|
|s.index | access index of series|
|s.astype(dtype, errors='raise') | cast a series to _dtype_. To ignore errors (and return the original object) use errors='ignore'|
|s\[boolean_array\] | return values from s where _boolean_array_ is _True_|
|s.cat.ordered | determine if a categorical series is ordered|
|s.cat.reoder_categories(new_categories, ordered=False) | add categories (potentially ordered) to the series.  _new_categories_ must include all categories.





In [1]:
import pandas as pd
import numpy as np

# Introduction

### 1. The pandas Series
Let's first create a pandas Series from a list. To get the best speed (and to leverage vectorized operations), the values in should be of the same type, though this is not required.

In [7]:
songs = pd.Series([145, 142, 38, 13], name='counts')
songs

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

It is easy to inspect the index of a series (or a data frame), as it is an attribute of the object:

In [8]:
songs.index

RangeIndex(start=0, stop=4, step=1)

The index can also be string-based, in which case pandas indeicates that the datatype for the index is object (not string):

In [9]:
songs2 = pd.Series([145, 142, 38, 13], 
                   name='counts',
                   index=['Paul', 'John', 'George', 'Ringo'])
songs2

Paul      145
John      142
George     38
Ringo      13
Name: counts, dtype: int64

In [10]:
songs2.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')

The actual data (or values) for a series does not have to be numerical or homogeneous:

In [11]:
class Foo:
    pass

ringo = pd.Series(
    ['Richard', 'Starkey', 13, Foo()],
    name='ringo')
ringo

0                                    Richard
1                                    Starkey
2                                         13
3    <__main__.Foo object at 0x7fe2ec9742b0>
Name: ringo, dtype: object

The object data type is used for a series with string values or values that have mixed types. If you have time data, make sure the type is _datetime64\[ns\]_ instead of _object_ (which means you probably have strings for dates), so you can use the date operations.

### 2. NaN value
When pandas determined that a Series holds numeric values but cannot find a number to represent an entry, it will use **NaN**. This value stands for _Not a Number_ and is usually ignored in arithmetic operations.

In [14]:
nan_series = pd.Series([2, np.nan],
                      index=['Ono', 'Clapton'])
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

The type of this series is _float64_, not _int64_, because _float64_ supports _NaN_, while _int64_ does not.  
Here is an example of how pandas ignores _NaN_. The .count method disregards _NaN_. It indicates that the count of items in the series is one.

In [15]:
nan_series.count()

1

You cann inspect the number of entries (including missing. values) with the .size property:

In [16]:
nan_series.size

2

### 3. Optional Integer Support for NaN

As of pandas 0.24, there is optional support for _nullable integer type_. When you create a series, you can pass in `dtype='Int64'` (note the capitalization):

In [17]:
nan_series2 = pd.Series([2, None],
                       index=['Ono', 'Clapton'],
                       dtype='Int64')
nan_series2

Ono           2
Clapton    <NA>
dtype: Int64

In [18]:
nan_series2.count()

1

You can use the .astype method to convert columns to the nullable integer type.

In [19]:
nan_series.astype('Int64')

Ono           2
Clapton    <NA>
dtype: Int64

You can generally ignore 'Int64' as it is good to clean up missing data. Also, when you ingest data in pandas, most functions use 'int64' (in lower case) by default.

### 4. Similar to Numpy

Both Series object and Numpy array respond to index oprations:

In [23]:
numpy_ser = np.array([145, 142, 38, 13])
print(songs2[1])
print(numpy_ser[1])

142
142


They both have methods in common:

In [24]:
print(songs2.mean())
print(numpy_ser.mean())

84.5
84.5


They both have a notion of a _boolean array_. A boolean array is a series with the same index as the series you are working with, and has boolean values. It can be used as a mask to filter out items.

In [27]:
mask = songs2 > songs2.median()
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

In [28]:
songs2[mask]

Paul    145
John    142
Name: counts, dtype: int64

Numpy also has filtering by boolean arrays, but lacks the _.median_ method.

In [29]:
numpy_ser[numpy_ser > np.median(numpy_ser)]

array([145, 142])

### 5. Categorical Data

Categorical values has a few benefits:
* Use less memory than strings
* Improve performance
* Can have an ordering
* Can perform operations onn categories
* Enforce membership on values

Categories are not limited to strings; we cann also convert numbers or datetime values to categorical data.

In [30]:
s = pd.Series(['m', 'l', 'xs', 's', 'xl'],
             dtype='category')
s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

In [31]:
s.cat.ordered

False

To convert a non-categorical series to an ordered category, we can create a type with _CategoricalDtype_ constructor and the appropriate parameters.Then, we pass this type into the `.astype` method:

In [34]:
s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'])
size_type = pd.api.types.CategoricalDtype(
    categories=['s', 'm', 'l'], ordered=True)
s3 = s2.astype(size_type)
s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

We can do comparison on ordered categories:

In [35]:
s3 > 's'

0     True
1     True
2    False
3    False
4    False
dtype: bool

We can also add odering information to categorical data. We just need to make sure that we specify all of. the members of the category or pandas will throw a _ValueError_:

In [36]:
s.cat.reorder_categories(['xs', 's', 'm', 'l', 'xl'], ordered=True)

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

String and datatime series have a _str_ and _dt_ attribute that allow us to perform common operations specific to that type. If we convert these types to categorical types, we can still use the _str_ or _dt_ attributes on them:

In [37]:
s3.str.upper()

0      M
1      L
2    NaN
3      S
4    NaN
dtype: object

# Deep Dive
### Loading the data
The `read_csv` function can accept not only URLs but also ZIP files. Because the ZIP file contains only a single file, we can use this function directly. If the ZIP file contains multiple files, we would need to decompress the data to pull out the files we are interested in.

Let's look at the columns _city08_ and _highway08_ in vehicles.csv

In [15]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)
city_mpg = df.city08
highway_mpg = df.highway08

In [16]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [17]:
highway_mpg

0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

### Series Attributes

Let's examine how many attributes there are on a series:

In [18]:
len(dir(city_mpg))

419

If you have a series object, you can hit TAB after a period, and it will pop up a list of completions.
Here's a summary of what functionality all of these attributes provide:
* **Dunder methods** (`.__add__`, `.__iter__`, etc) provides many numeric operations, looping, attribute access, and index access. For numeric operations, these return Series.
* Corresponding operator methods for many of the **numeric operations** allow us to tweak the behavior (there is an `.add` method in addition to `.__add__`)
* **Agrregate methods and properties** which reduce or aggregate the values in a series down to a single scalar value. The `.mean`, `.max`, and `.sum` methods and `.is_monotonic` property are all examples.
* **Conversion methods**. Some of these start with `.to_` and export the data to other formats.
* **Manipulation methods** such as `.sort_values`, `.drop_duplicates`, that retrun _Series_ object with the same index.
* **Indexing and accessor methods and attributes** such as `.loc` and `.iloc`. These return _Series_ or scalars.
* **String manipulation methods** using `.str`.
* **Date manipulation methods** using `.dt`.
* **Categorical manipulation methods** using `.cat`.
* **Plotting methods** using `.plot`.
* **Transformation methods** such as `.unstack` and `.rest_index`, `.agg`, `.transform`.
* **Attributes** such as `.dtype` and `.index`.
* A bunch of _private_ attributes that we will ignore (about 130 of them)