# worksheet7: Pandas Series

- pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python
- pandas is often used in tandem with numerical computing tools 
like NumPy and SciPy, analytical libraries like statsmodels and 
scikit-learn, and data visualization libraries like matplotlib
- pandas adopts significant parts of NumPy’s idiomatic style of 
array-based computing, especially array-based functions and 
a preference for data processing without for loops.

In [None]:
import numpy as np
import pandas as pd

In [None]:
%%html
<h1>Series and DataFrame</h1>

- Series is a 1D array representing a single data column (i.e single column of an excel spreadsheet)
- A DataFrame is a 2D array used to represent tabular or spreadsheet-like data (i.e rows and columns)
- DataFrame can be thought of as a collection of Series (columns)
- Both Series and DataFrame have an index (an associate array of data labels)

### Create a simple pandas series with 1D data
- using `pd.Series` on a list or np.array

In [None]:
gene_lengths_kb = [2, 4, 4, 5 ]

In [None]:
gene_lengths_kb_ser = pd.Series(gene_lengths_kb)

In [None]:
gene_lengths_kb_ser

In [None]:
gene_lengths_kb_arr = np.array(gene_lengths_kb)

In [None]:
gene_lengths_kb_ser_arr = pd.Series(gene_lengths_kb_arr)

In [None]:
gene_lengths_kb_ser_arr

### Components of a series 
- index is the leftmost column of a series, this is the data label
- the actual data is called values of the series (the right column)
- a default index with `range(N)` is created when you do not specify an index, where `N` is the length of the data

In [None]:
gene_lengths_kb_ser

### series.index

In [None]:
gene_lengths_kb_ser.index

### series.values

In [None]:
gene_lengths_kb_ser.values

In [None]:
type(gene_lengths_kb_ser.values)

### Often desirable to specify your own index data labels
- called a `named series`, with prespecified index labels
- allows you to index series by label

In [None]:
gene_lengths_kb = [2, 4, 4, 5 ]

In [None]:
gene_names = ['BRCA1','BRCA2', 'SMAD2', 'TTN']

In [None]:
gene_lengths_kb_ser = pd.Series(
    gene_lengths_kb, 
    index=gene_names
)

In [None]:
gene_lengths_kb_ser

### index elements of a series by data labels

In [None]:
gene_lengths_kb_ser['BRCA1']

In [None]:
gene_lengths_kb_ser['SMAD2']

### index elements of a series by position

In [None]:
# BRCA1 gene length
gene_lengths_kb_ser[0]

In [None]:
# SMAD2 gene length
gene_lengths_kb_ser[2]

### index elements of a series by a list of data labels

In [None]:
gene_names[:2]

In [None]:
gene_lengths_kb_ser[gene_names[:2]]

## numpy functions applied to a series

### Q: apply a numpy boolean mask to a series
- can you filter `gene_lengths_kb_ser` to only include genes with lengths >2kb?


### Q: apply numpy broadcasting
- add 1 kb to all gene lengths in the series
- multiply all gene lengths by 2kb
- exponentiate gene lengths

### Series can also be thought of as 
- a mapping of index labels to data values
- similar to an ordered dictionary

In [None]:
'BRCA1' in gene_lengths_kb_ser

In [None]:
'KRAS' in gene_lengths_kb_ser

In [None]:
'KRAS' not in gene_lengths_kb_ser

### Ways to create a series
- using `pd.Series` on a list or np.array
- using `pd.Series` on a dict

Let us create a dict with `gene_names` and `gene_lengths_kb` data in a cool new way and create a series with this dict

In [None]:
gene_names

In [None]:
gene_lengths_kb

In [None]:
gene_lengths_kb_dict = {
    k : v
    for k,v in zip(gene_names, gene_lengths_kb)
}

In [None]:
gene_lengths_kb_dict

In [None]:
gene_lengths_kb_ser_from_dict = pd.Series(gene_lengths_kb_dict)

In [None]:
# default order of index labels is sorted
gene_lengths_kb_ser_from_dict

### Q: can you change the order in which these index labels appear in the series?
- I'd like in this order: 'TTN', 'BRCA2', 'SMAD2', 'BRCA1'

In [None]:
new_order = ['TTN', 'BRCA2', 'SMAD2', 'BRCA1']


### Q: can you exclude `SMAD2` from the series?

In [None]:
new_order = ['TTN', 'BRCA2', 'BRCA1']
pd.Series(gene_lengths_kb_dict, index=new_order)

### add a new gene label `HRAS` to the above series
- just the label, no data

In [None]:
new_order = ['TTN', 'BRCA2', 'BRCA1', 'HRAS']
hras_ser = pd.Series(gene_lengths_kb_dict, index=new_order)

In [None]:
hras_ser

### Missing values check
- pd.isnull()
- pd.notnull()
- ser.isnull()
- ser.notnull()
- ser.isna()
- ser.notna()
- pd.isna()
- pd.notna()

### Q: what should `pd.isnull(hras_ser)` output?

In [None]:
pd.isnull(hras_ser)

### Q: what should `pd.notnull(hras_ser)` output?

In [None]:
pd.notnull(hras_ser)

### series has the same methods as pd 

In [None]:
hras_ser.isnull()

In [None]:
hras_ser.notnull()

In [None]:
hras_ser.isna()

In [None]:
hras_ser.notna()

### arithmetic operations
- series addition
- automatically align index labels

In [None]:
gene_lengths_one_kb = [1] * 4

In [None]:
gene_lengths_one_kb

In [None]:
gene_names

In [None]:
gene_lengths_one_ser = pd.Series(
    gene_lengths_one_kb,
    index=gene_names
)

In [None]:
gene_lengths_one_ser

In [None]:
gene_lengths_kb_ser

In [None]:
gene_lengths_kb_ser + gene_lengths_one_ser

### Q: guess the output of 
- gene_lengths_kb_ser + hras_ser

### Other series manipulations
- change series index
- specify a `dtype`
- name a series
- get the `size` of a series

#### change series index

In [None]:
hras_ser

In [None]:
new_hras_ser_index = [idx + '_gene' for idx in hras_ser.index]

In [None]:
hras_ser.index = new_hras_ser_index

In [None]:
hras_ser

#### specify a dtype

In [None]:
gene_lengths_kb_ser

In [None]:
gene_lengths_kb_ser_different_dtype = pd.Series(
    gene_lengths_kb,
    index=gene_names,
    dtype='float'
)

In [None]:
gene_lengths_kb_ser_different_dtype

#### name a series

In [None]:
gene_lengths_kb_ser_different_dtype = pd.Series(
    gene_lengths_kb,
    index=gene_names,
    dtype='float',
    name='gene_lengths_kb_ser_different_dtype'
)

In [None]:
gene_lengths_kb_ser_different_dtype

In [None]:
gene_lengths_kb_ser_different_dtype.name

#### size of a series

In [None]:
gene_lengths_kb_ser_different_dtype.size

#### Q: can you reshape a series into higher D?

### Slicing a series
- very similar to slicing a numpy 1D array
- allows you to slice by label or integer positions

In [None]:
gene_lengths_kb_ser

In [None]:
gene_lengths_kb_ser[-2:]

In [None]:
gene_lengths_kb_ser[2:5:1]

In [None]:
gene_lengths_kb_ser['BRCA1':'TTN']

In [None]:
gene_lengths_kb_ser[:4]

In [None]:
gene_lengths_kb_ser['BRCA1']

In [None]:
gene_lengths_kb_ser[0]

### Selection with `loc` and `iloc`
- enable you to select a subset of rows and columns
- label based(`.loc`)
- integer labels (`.iloc`)
- more when we discuss DataFrames

### Need for `.loc` and `.iloc`
- confusing to index a series based on both integers and index labels
- if index labels are integers, data selection will always be by labels
- this makes selection by integers unreliable

In [None]:
integer_series = pd.Series(
    np.arange(5)
)

In [None]:
integer_series

In [None]:
integer_series[1]

### Q: will this work?
integer_series[-1]

### what about now?

In [None]:
integer_series = pd.Series(
    np.arange(5),
    index = ['a', 'b', 'c', 'd', 'e']
)

In [None]:
integer_series[-1]

#### `series.loc`
- indexing series explictly using only the index label
- indexing by position wont work
- can work with individual index labels or slices of labels

In [None]:
# can specifically only use label with `.loc`
gene_lengths_kb_ser.loc['BRCA2']

### will this work?
- gene_lengths_kb_ser.loc[1]

In [None]:
gene_lengths_kb_ser.loc['BRCA2':'KRAS']

In [None]:
gene_lengths_kb_ser.loc[['BRCA2', 'BRCA1']]

#### add a new gene to the series using `.loc`

In [None]:
gene_lengths_kb_ser.loc['TMPRSS2'] = 10

In [None]:
gene_lengths_kb_ser

#### `series.iloc` 
- indexing series explicitly by integer location
- indexing by label wont work
- can work with individual integer locations or slices of integer locations

In [None]:
gene_lengths_kb_ser.iloc[1]

### will this work?
- gene_lengths_kb_ser.iloc['BRCA2']

In [None]:
gene_lengths_kb_ser.iloc[:4]

In [None]:
integer_series.iloc[-1]

## Methods on series

#### simple aggregation

In [None]:
gene_lengths_kb_ser.mean()

#### .agg method:  powerhouse

In [None]:
result = gene_lengths_kb_ser.agg(['mean', 'max', 'var', 
                         'prod', 'min', 'median',
                         'all', 'any',
                         'std', 'sum', 'nunique',
                         'sem', 'size'                        
                        ])

In [None]:
result

In [None]:
type(result)

### .value_counts()

In [None]:
gene_lengths_kb_ser.value_counts()

### apply function

In [None]:
# with a numpy function
gene_lengths_kb_ser.apply(np.log)

# Collaborative exercises

## Exercise 1

- Decide on a threshold voltage for hyperpolarization (or depolarization) with your team
- Write a custom function that accepts a voltage and returns a `hyperpolarizing`  or `depolarizing` label as output for valid voltages. Raise appropriate exceptions for invalid voltages, including a `missing input` or `invalid input` label
 - Create a randomly simulated numpy array of voltages (n=10). Select two random positions in this array without replacement using `np.random.choice` and set the values to `np.nan` using numpy integer masks
- Convert numpy array to series
- Apply your custom function to the series and show the output
- Can you also check for missing values and print which rows of the series have those missing values?
- Can you create a new series where you drop the rows with missing values using `series.dropna()`?
- Can you create a new series where you fill the missing values with `0` using `series.fillna(0)`?
- https://numpy.org/doc/2.1/reference/random/generated/numpy.random.choice.html

## Exercise 2

- Create two toy datasets of EEG potentials recorded in micro-volts for four different electrodes, with one potential per electrode. Introduce atleast one missing value in each dataset. 

- An example dataset could look like the following. Note that you will create two such datasets, with the same `electrode_name` in each, but different voltages

```
electrode_name, micro_volts
electrode1, 5
electrode2, 2
electrode3,  np.nan
electrode4, 10
```
- Load each dataset into a pandas series
- First, calculate the total voltage for each electrode using series addition. Explain output
- Second, add data for an extra electrode in the second dataset, but not the first. Redo the total voltage calculation. What do you observe?
- Third, create a new pandas series for the second dataset, this time with no index labels. Perform series addition again. Explain your observations

## Exercise 3

- Randomly simulate 100 different EEG potentials with data labels and load into a pandas `series_A`
- Randomly select 50 rows of `series_A` using numpy and save in a new series called `ser_A_dup_data`. Modify the index labels of `ser_A_dup_data`
- Create a new `series_B` combining `series_A` with `ser_A_dup_data` (you just created a new series with duplicates). Use `pd.concat()` for this
- Calculate the 10%, 50% and 90% quantile on `series_B` using `series.quantile(q)` where q = [0.1, 0.5 or 0.9]
- Experiment with sort_index(), sort_values() and drop_duplicates() function on `series_B`. Run drop_duplicates() with no arguments, keep='first', keep='last' and keep=False. What do you observe?
- Experiment with `series_B.reset_index()` and `series_B.reset_index(drop=True)`. What do you observe? Explain your findings