# Pandas DataStructure known as the Series
### What is Pandas?
#### Pandas is a library used a great deal in the "Data Science" community that encapsulates arrays and provides a lot of functionality and  optimization for certain functions.

### Would I use Pandas for everything?
#### Nope.  Machine learning, see 004_sklearn_pandas_linearRegress_opticsMoorningData, likes single dimensional arrays.
#### But I would use Pandas to read, prep, and then marshal data into the structure my machine learning API wants.

### Are there other options?

+ [Dask](https://www.dask.org/):
Provides a Pandas-like API that can handle datasets larger than memory by distributing computations across multiple cores or machines.

+ [Polars](https://pola.rs/):
A newer library with a focus on performance and memory efficiency, making it excellent for large datasets and complex operations.

+ [Xarray](https://docs.xarray.dev/en/stable/): For multi-dimensional labeled arrays, similar to Pandas DataFrames but with additional functionality for handling dimensions.

+ [CUPY](https://cupy.dev/): CuPy is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.
The figure shows CuPy speedup over NumPy. Most operations perform well on a GPU using CuPy out of the box. CuPy speeds up some operations more than 100X.

### Setup and Install minimally required libraries

In [None]:
# Import key libraries necessary to support dynamic installation of additional libraries
import sys
# Use subprocess to support running operating system commands from the program, using the "bang" (!)
# symbology is supported, however that does not translate to an actual python script, this is a more
# agnostic approach.
import subprocess
import importlib.util

# Identify the libraries you'd like to add to this Runtime environment.
libraries=["rich", "rich[jupyter]", "unidecode", "icecream",
           "polars[all]", "dask[complete]", "xarray",]

# Loop through each library and test for existence, if not present install quietly
for library in libraries:
    if library == "Pillow":
      spec = importlib.util.find_spec("PIL")
    else:
      spec = importlib.util.find_spec(library)
    if spec is None:
      print("Installing library " + library)
      subprocess.run(["pip", "install" , library, "--quiet"], check=True)
    else:
      print("Library " + library + " already installed.")

Library rich already installed.
Installing library rich[jupyter]
Library unidecode already installed.
Installing library icecream
Installing library polars[all]
Installing library dask[complete]
Library xarray already installed.


In [None]:
import numpy as np
import pandas as pd
import polars as pl
import dask as da
import xarray as xr
from rich import print as rprint
from icecream import ic

## Quick Pro-tips

#### References:

+ [Polars Config](https://docs.pola.rs/api/python/stable/reference/config.html)

+ [Pandas Config](https://pandas.pydata.org/docs/user_guide/options.html)

+ [Dask Config](https://docs.dask.org/en/latest/configuration.html)

+ [Xarray Config](https://docs.xarray.dev/en/stable/generated/xarray.set_options.html)

In [None]:
#library configurations examples using Pandas

#show all data returned from the dataset (could be HUGE, be careful)
pd.set_option('display.max_rows', None)
#or
pd.set_option('display.max_rows', 10)

#also note that it gets tiring seeing LOTS of floating points
pd.options.display.float_format = '{:,.4f}'.format

#nump equivalent
np.set_printoptions(precision=4)

## Series

### Pandas Series

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

### Polars Series



References: https://docs.pola.rs/py-polars/html/reference/series/index.html

In [None]:
#Series is a one-dimensional labeled array capable of holding any data type
series = pd.Series([1,2,3,4,5,'red','green','blue',6,7,8,9]);

rprint(series)

# OR
print("--------------------------------------------------------------------------\n\n")

ic(series)

ic| series: 0        1
            1        2
            2        3
            3        4
            4        5
                  ... 
            7     blue
            8        6
            9        7
            10       8
            11       9
            Length: 12, dtype: object


--------------------------------------------------------------------------




Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
...,...
7,blue
8,6
9,7
10,8


In [None]:
#If data is an ndarray, index must be the same length as data. If no index is passed, one will be created
series=pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
ic()
rprint("\n")
rprint(series)
rprint("-------------------------------------------------------------------")
rprint(series.index)
rprint("-------------------------------------------------------------------")
rprint(series[0])
rprint("\n")
ic(series[0])
rprint("-------------------------------------------------------------------")
rprint(series[:])

ic| <ipython-input-37-b412de0e3e4b>:3 in <cell line: 3>() at 19:22:52.644


  rprint(series[0])


  ic(series[0])
ic| series[0]: 0.5818827337598729


In [None]:
#notice that a series can be created from a classic (key=value pair) dictionary
d = {'b': 1, 'a': 0, 'c': 2}
series=pd.Series(d)
rprint(series)
rprint(series["b"])

In [None]:
#Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.
#If data is an ndarray, index must be the same length as data. If no index is passed, one will be created
series=pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
rprint("Full array")
rprint("################################################################################################################")
rprint(series)
rprint("################################################################################################################")
rprint("")
rprint ("Just the first index")
rprint("    When directly indexed the 'index' is not included.")
rprint("################################################################################################################")
rprint(series[0])
rprint("")

rprint(" All values up to element #3")
rprint("################################################################################################################")
rprint(series[:3])
rprint("")

rprint ("Only those values greater than the median")
rprint("################################################################################################################")
rprint(series[series > series.median()])
rprint("")

rprint("Integrate with numpy and calculate the exponent, notice Numpy integration")
rprint("################################################################################################################")
rprint(np.exp(series))

  rprint(series[0])


In [None]:
#Series data type operations
rprint(series.dtype)

In [None]:
#Get the actual array in a series, maybe for direct manipulation
rprint("Dump the contents of the Series into a single dimensional Numpy array.")
rprint("###############################################################################################")
rprint(series.values)
rprint("")
rprint("My series dimensions are: ",series.ndim)
rprint("My series size is:", series.size)
rprint("My series shape is:", series.shape)
rprint("")
rprint("###############################################################################################")
my_array=series.values
rprint("My array dimensions are: ",my_array.ndim)
rprint("My array size is:", my_array.size)
rprint("My array shape is:", my_array.shape)

rprint("")
rprint("###############################################################################################")
#traditional Python for loop
for idx in range(0,my_array.size):
    rprint(my_array[idx]);

In [None]:
#now actually store the series in an xarray
my_xarray=series.to_xarray()
my_xarray

In [None]:
#dictionary type structure example
rprint("Key 'a' access:",series['a'])
rprint("")
rprint("Example of a bad key request for 'z' with a check:", 'z' in series)
rprint("")
rprint ("or")
rprint("")
rprint ("Key 'z' access with a .get:", series.get('z'))
rprint("")
rprint ("or perhaps more elegant")
rprint("")
rprint("Key 'z' access with a .get and return for failure:", series.get('z','Not found'))


In [None]:
#vector manipulations
add_series=series+series
rprint("Series added to itself:\n", add_series)

rprint("###############################################################################################")

multiply_series=series * 2
rprint("")
rprint("Series multiplied by 2:\n", multiply_series)


In [None]:
#Series attribution
rprint("Name your data")
rprint("###############################################################################################")
rprint(series.name)
rprint("or")
series2 = series.rename("My Example Series")
rprint(series2.name)