## Pandas in 2 mins
You can't learn Pandas in 2 mins, but here are some of the basics needed for this course.

First, you can define a dict containing credit card payments, labeled as fraud or not-fraud, and create a Pandas DataFrame from it.

In [None]:
import pandas as pd

data = { 
    'credit_card_number': ['1111 2222 3333 4444', '1111 2222 3333 4444','1111 2222 3333 4444',
                           '1111 2222 3333 4444'],
    'trans_datetime': ['2022-01-01 08:44', '2022-01-01 19:44', '2022-01-01 20:44', '2022-01-01 20:55'],
    'amount': [142.34, 12.34, 66.29, 112.33],
    'location': ['Sao Paolo', 'Rio De Janeiro', 'Stockholm', 'Stockholm'],
    'fraud': [False, False, True, True] 
}

df = pd.DataFrame.from_dict(data)
df['trans_datetime']= pd.to_datetime(df['trans_datetime'])
df

In [None]:
df

In [None]:
df.info()

In [None]:
df['trans_datetime']= pd.to_datetime(df['trans_datetime'])
df.info()

### Lambda functions

We will now apply a lambda function to the column `amount` and save the result in a new column `is_big` in our DataFrame `df`.

In [None]:
df['is_big'] = df['amount'].apply(lambda amount: amount > 100)
df

### Apply and UDFs

We will now apply a user-defined function (UDF), `is_small`, to each row in the data DataFrame `df`.  
The result is a series that we store in a new column in `df` called 'is_small'.

In [None]:
def is_small(row):
    return row['amount'] < 100
    
df['is_small'] = df.apply(is_small, axis=1)
df

## Rolling Windows

We will compute a rolling window over the day.

In [None]:
df_rolling = df.set_index('trans_datetime')
df_rolling

In [None]:
df_rolling['rolling_max_1d'] = df_rolling.rolling('1D').amount.max()
df_rolling

Let's create a new DataFrame, `d2`, with new data.

In [None]:
import numpy as np
import timeit 

df2 = pd.DataFrame({
    'a':np.random.randint(1,100, size=10000),
    'b':np.random.randint(100,1000, size=10000),
    'c':np.random.random(10000)
})
df2.shape
(100000, 3)

### Vectorized operations are faster than "apply" with UDFs

We will see that apply is approximately 50 times slower than the equivalent vectorized operation on 100k rows.



In [None]:
%%timeit
df2['a'].apply(lambda x: x**2)

This vectorized operation is much faster

In [None]:
%%timeit
df2['a'] ** 2

In [None]:
df2.describe()

In [None]:
df.trans_datetime.unique()

In [None]:
df.credit_card_number.nunique()

In [None]:
df.isnull().sum()

## Transformations

Plot a histogram with a long tail.
Use numpy to seed the random number generator and generate a univariate data sample.


In [None]:
import seaborn as sns

from numpy.random import seed
from numpy.random import randn
from numpy.random import rand
from numpy import append
seed(1)
array = 5 * randn(100) + 10
tail = 10 + (rand(50) * 100)
array = append(array, tail)
sns.histplot(array)

In [None]:
columns = ['amount']
df_exp = pd.DataFrame(data = array, columns = columns)
  
df_exp.describe()

In [None]:
df_exp

## Standard Scalar in Vectorized Pandas

This is an efficient way to transform our input Pandas column into a range of [0.0, 1.]

In [None]:
# Min-Max Normalization in Pandas
df_norm = (df_exp-df_exp.min())/(df_exp.max()-df_exp.min())
df_norm.head()

In [None]:
sns.histplot(df_norm)

## Power Transformer in Scikit-Learn

Scikit-Learn has many different transformation libraries.
For heavy-tailed distributions, it is often recommended to perform a [power transformation](
https://towardsdatascience.com/how-to-differentiate-between-scaling-normalization-and-log-transformations-69873d365a94)

We can see in the histogram, this produces a more Gaussian (normal) distribution than the MinMax Scalar.

In [None]:
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer()

df_power = pd.DataFrame(
    pt.fit_transform(df_exp[["amount"]]), columns=["amount"]
)

sns.histplot(df_power)