# Data Science: Data processing

Typical packages: `pandas`, `plotnine`, `plotly`, `streamlit`

**References**

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
- [Python for Data Analysis, 2nd Edition](https://github.com/wesm/pydata-book)

## Introduction to `pandas`

In [None]:
import numpy as np
import pandas as pd

## Series and Data Frames

### Series objects

A `Series` is like a vector. All elements must have the same type or are nulls.

In [None]:
s = pd.Series([1,1,2,3] + [None])
s

### Size

In [None]:
s.size

### Unique Counts

In [None]:
s.value_counts()

### Special types of series

#### Strings

In [None]:
words = 'the quick brown fox jumps over the lazy dog'.split()
s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])
s1

In [None]:
s1.str.upper()

In [None]:
s1.str.split()

In [None]:
s1.str.split().str[1]

### Categories

In [None]:
s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])
s2

In [None]:
s2 = s2.astype('category')
s2

In [None]:
s2.cat.categories

In [None]:
s2.cat.codes

### Dates and times

Datetimes are often useful as indices to a time series.

In [None]:
import pendulum

In [None]:
d = pendulum.today()

In [None]:
d.to_date_string()

In [None]:
k = 18
s3 = pd.Series(range(k), 
               index=pd.date_range(d.to_date_string(),
                                   periods=k, 
                                   freq='M'))

In [None]:
s3

In [None]:
s3['2021']

In [None]:
s3['2021-01':'2021-06']

If used as a series, then need `dt` accessor method

In [None]:
s4 = s3.index.to_series()

In [None]:
s4.dt.day_name()

### DataFrame objects

A `DataFrame` is like a matrix. Columns in a `DataFrame` are `Series`.

- Each column in a DataFrame represents a **variale**
- Each row in a DataFrame represents an **observation**
- Each cell in a DataFrame represents a **value**

In [None]:
df = pd.DataFrame(dict(num=[1,2,3] + [None]))
df

In [None]:
df.num

### Index

Row and column identifiers are of `Index` type.

Somewhat confusingly, index is also a a synonym for the row identifiers.

In [None]:
df.index

#### Setting a column as the row index

In [None]:
df

In [None]:
df1 = df.set_index('num')
df1

#### Making an index into a column

In [None]:
df1.reset_index()

### Columns

This is just a different index object

In [None]:
df.columns

### Getting raw values

Sometimes you just want a `numpy` array, and not a `pandas` object.

In [None]:
df.values

## Creating Data Frames

### Manual

In [None]:
from collections import OrderedDict

In [None]:
n = 5
dates = pd.date_range(start='now', periods=n, freq='d')
df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n), 
                              weight=np.random.normal(70, 20, n),
                              height=np.random.normal(170, 15, n),
                              date=dates,
                             ))
df

### From file

You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like `read_X` where X is the data type.

In [None]:
%%file measures.txt
pid	weight	height	date
328	72.654347	203.560866	2018-11-11 14:16:18.148411
756	34.027679	189.847316	2018-11-12 14:16:18.148411
185	28.501914	158.646074	2018-11-13 14:16:18.148411
507	17.396343	180.795993	2018-11-14 14:16:18.148411
919	64.724301	173.564725	2018-11-15 14:16:18.148411

In [None]:
df = pd.read_table('measures.txt')
df

## Indexing Data Frames

### Implicit defaults

if you provide a slice, it is assumed that you are asking for rows.

In [None]:
df[1:3]

If you provide a singe value or list, it is assumed that you are asking for columns.

In [None]:
df[['pid', 'weight']]

### Extracting a column

#### Dictionary style access

In [None]:
df['pid']

#### Property style access

This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)

In [None]:
df.pid

### Indexing by location

This is similar to `numpy` indexing

In [None]:
df.iloc[1:3, :]

In [None]:
df.iloc[1:3, 1:4:2]

### Indexing by name

In [None]:
df.loc[1:3, 'weight':'height']

**Warning**: When using `loc`, the row slice indicates row names, not positions.

In [None]:
df1 = df.copy()
df1.index = df.index + 1
df1

In [None]:
df1.loc[1:3, 'weight':'height']

## Structure of a Data Frame

### Data types

In [None]:
df.dtypes

### Converting data types

#### Using `astype` on one column

In [None]:
df.pid = df.pid.astype('category')

#### Using `astype` on multiple columns

In [None]:
df = df.astype(dict(weight=float, height=float))

#### Using a conversion function

In [None]:
df.date = pd.to_datetime(df.date)

#### Check

In [None]:
df.dtypes

### Basic properties

In [None]:
df.size

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

### Inspection

In [None]:
df.head(n=3)

In [None]:
df.tail(n=3)

In [None]:
df.sample(n=3)

In [None]:
df.sample(frac=0.5)

## Selecting, Renaming and Removing Columns

### Selecting columns

In [None]:
df.filter(items=['pid', 'date'])

In [None]:
df.filter(regex='.*ght')

#### Note that you can also use regular string methods on the columns

In [None]:
df.loc[:, df.columns.str.contains('d')]

### Renaming columns

In [None]:
df.rename(dict(weight='w', height='h'), axis=1)

In [None]:
orig_cols = df.columns 

In [None]:
df.columns = list('abcd')

In [None]:
df

In [None]:
df.columns = orig_cols

In [None]:
df

### Removing columns

In [None]:
df.drop(['pid', 'date'], axis=1)

In [None]:
df.drop(columns=['pid', 'date'])

In [None]:
df.drop(columns=df.columns[df.columns.str.contains('d')])

## Selecting, Renaming and Removing Rows

### Selecting rows

In [None]:
df[df.weight.between(60,70)]

In [None]:
df[(69 <= df.weight) & (df.weight < 70)]

In [None]:
df[df.date.between(pd.to_datetime('2018-11-13'), 
                   pd.to_datetime('2018-11-15 23:59:59'))]

### Renaming rows

In [None]:
df.rename({i:letter for i,letter in enumerate('abcde')})

In [None]:
df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']

In [None]:
df

In [None]:
df = df.reset_index(drop=True)

In [None]:
df

### Dropping rows

In [None]:
df.drop([1,3], axis=0)

#### Dropping duplicated data

In [None]:
df['something'] = [1,1,None,2,None]

In [None]:
df.loc[df.something.duplicated()]

In [None]:
df.drop_duplicates(subset='something')

#### Dropping missing data

In [None]:
df

In [None]:
df.something.fillna(0)

In [None]:
df.something.ffill()

In [None]:
df.something.bfill()

In [None]:
df.something.interpolate()

In [None]:
df.dropna()

## Transforming and Creating Columns

In [None]:
df.assign(bmi=df['weight'] / (df['height']/100)**2)

In [None]:
df['bmi'] = df['weight'] / (df['height']/100)**2

In [None]:
df

In [None]:
df['something'] = [2,2,None,None,3]

In [None]:
df

## Sorting Data Frames

### Sort on indexes

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=0, ascending=False)

### Sort on values

In [None]:
df.sort_values(by=['something', 'bmi'], ascending=[True, False])

## Summarizing

### Apply an aggregation function

In [None]:
df.select_dtypes(include=np.number)

In [None]:
df.select_dtypes(include=np.number).agg(np.sum)

In [None]:
df.agg(['count', np.sum, np.mean])

## Split-Apply-Combine

We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with `groupby` followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame.

In [None]:
df['treatment'] = list('ababa')

In [None]:
df

In [None]:
grouped = df.groupby('treatment')

In [None]:
grouped.get_group('a')

In [None]:
grouped.mean()

### Using `agg` with `groupby`

In [None]:
grouped.agg('mean')

In [None]:
grouped.agg(['mean', 'std'])

In [None]:
grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})

### Using `trasnform` wtih `groupby`

In [None]:
g_mean = grouped[['weight', 'height']].transform(np.mean)
g_mean

In [None]:
g_std = grouped[['weight', 'height']].transform(np.std)
g_std

In [None]:
(df[['weight', 'height']] - g_mean)/g_std

## Combining Data Frames

In [None]:
df

In [None]:
df1 =  df.iloc[3:].copy()

In [None]:
df1.drop('something', axis=1, inplace=True)
df1

### Adding rows

Note that `pandas` aligns by column indexes automatically.

In [None]:
df.append(df1, sort=False)

In [None]:
pd.concat([df, df1], sort=False)

### Adding columns

In [None]:
df.pid

In [None]:
df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))

In [None]:
df2.pid

In [None]:
df.pid = df.pid.astype('int')

In [None]:
pd.merge(df, df2, on='pid', how='inner')

In [None]:
pd.merge(df, df2, on='pid', how='left')

In [None]:
pd.merge(df, df2, on='pid', how='right')

In [None]:
pd.merge(df, df2, on='pid', how='outer')

### Merging on the index

In [None]:
df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))
df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))
df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))

In [None]:
df1

In [None]:
df2

In [None]:
df3

In [None]:
df1.join([df2, df3])

## Fixing common DataFrame issues

### Multiple variables in a column

In [None]:
df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))
df

In [None]:
df.pid_treat.str.split('-')

In [None]:
df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])

### Multiple values in a cell

In [None]:
df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))
df

In [None]:
df[['t1', 't2', 't3']]  = df.vals.apply(pd.Series)
df

In [None]:
df.drop('vals', axis=1, inplace=True)

In [None]:
pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)

## Reshaping Data Frames

Sometimes we need to make rows into columns or vice versa.

### Converting multiple columns into a single column

This is often useful if you need to condition on some variable.

In [None]:
url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
iris = pd.read_csv(url)

In [None]:
iris.head()

In [None]:
iris.shape

In [None]:
df_iris = pd.melt(iris, id_vars='species')

In [None]:
df_iris.sample(10)

## Pivoting

Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary.

In [None]:
df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')

## Functional style - `apply`, `applymap` and `map`

`apply` can be used to apply a custom function

In [None]:
scores = pd.DataFrame(
    np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),
    columns = ['math', 'stat', 'biol'],
    index = ['anne', 'bob', 'charles', 'dirk', 'edgar']
)

In [None]:
scores

In [None]:
def convert_grade_1(score):
    return np.where(score > 90, 'A', 
                    np.where(score > 80, 'B',
                            np.where(score > 70, 'C', 'F')))

In [None]:
scores.apply(convert_grade_1)

The `np.where` is a little clumsy - here is an alternative.

In [None]:
def convert_grade_2(score):
    if score.name == 'math': # math professors are mean
        return np.choose(
            pd.cut(score, [-1, 80, 90, 95, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )    
    else:
        return np.choose(
            pd.cut(score, [-1, 70, 80, 90, 100], labels=False),
            ['F', 'C', 'B', 'A']
        )

In [None]:
scores.apply(convert_grade_2)

`apply` can be used to avoid explicit looping

In [None]:
def likely_profession(row):
    if (row.biol > row.math) and (row.biol > row.stat):
        return 'farmer'
    elif (row.math > row.biol) and (row.math > row.stat):
        return 'high school teacher'
    elif (row.stat > row.math) and (row.stat > row.biol):
        return 'actuary'
    else:
        return 'doctor'

In [None]:
scores.apply(likely_profession, axis=1)

If all else fails, you can loop over `pandas` data frames.

- Be prepared for pitying looks from more snobbish Python coders

Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance.

In [None]:
for idx, row in scores.iterrows():
    print(f'\nidx = {idx}\nrow = {row.index}: {row.values}\n', 
          end='-'*30)

`apply` can be used for reductions along margins

In [None]:
df = pd.DataFrame(np.random.randint(0, 10, (4,5)), columns=list('abcde'), index=list('wxyz'))

In [None]:
df

In [None]:
df.apply(sum, axis=0)

In [None]:
df.apply(sum, axis=1)

#### For element-wise mapping operations

In [None]:
import string

In [None]:
char_map = {i: c for i,c in enumerate(string.ascii_uppercase)}

In [None]:
df.applymap(lambda x: char_map[x])

#### For mapping a series

In [None]:
df.assign(b_map = df.b.map(char_map))

## Chaining commands

Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables.

In [None]:
(
    iris.
    sample(frac=0.2).
    filter(regex='s.*').
    assign(both=iris.sepal_length + iris.petal_length).
    query('both > 2').
    groupby('species').agg(['mean', 'sum']).
    pipe(lambda x: np.around(x, 1))
)

## Moving between R and Python in Jupyter

In [None]:
%load_ext rpy2.ipython

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

In [None]:
iris = %R iris

In [None]:
iris.head()

In [None]:
iris_py = iris.copy()
iris_py.Species = iris_py.Species.str.upper()

In [None]:
%%R -i iris_py -o iris_r

iris_r <- iris_py[1:3,]

In [None]:
iris_r

In [None]:
! python3 -m pip install --quiet watermark

In [None]:
%load_ext watermark

In [None]:
%watermark -v -iv