<a href="https://colab.research.google.com/github/aaron-abrams-uva/DS1002-S24/blob/main/Pandas/Review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas review notebook
Created by Prof. Neal Magee

## Pandas DataFrame: Create from lists of values

In [None]:
import pandas as pd

last_names = ['Connor', 'Connor', 'Reese']
first_names = ['Sarah', 'John', 'Kyle']
df = pd.DataFrame({
  'first_name': first_names,
  'last_name': last_names,
})
df

## Pandas DataFrame: Delete Column(s)

In [None]:
import pandas as pd

last_names = ['Connor', 'Connor', 'Reese']
first_names = ['Sarah', 'John', 'Kyle']
ages = [42, 46, 50]
df = pd.DataFrame({
  'first_name': first_names,
  'last_name': last_names,
  'ages': ages
})

df = df.drop(['first_name', 'ages'], axis=1)
df

## Pandas DataFrame: Sort by a specified column

In [None]:
import pandas as pd

last_names = ['Connor', 'Reese', 'Connor']
first_names = ['Sarah', 'Kyle', 'John']
df = pd.DataFrame({
  'first_name': first_names,
  'last_name': last_names,
})

df = df.sort_values(["last_name", "first_name"], ascending=True)
df

## Pandas DataFrame: Rename multiple Columns

In [None]:
import pandas as pd
df = pd.DataFrame({
    'Year': [2016, 2015, 2014, 2013, 2012],
    'Top Animal': ['Giant panda', 'Chicken', 'Pig', 'Turkey', 'Dog']
})

df.rename(columns={
    'Year': 'Calendar Year',
    'Top Animal': 'Favorite Animal',
}, inplace=True)
df

## Pandas DataFrame: Query by regexp (regular expression)

In [None]:
import pandas as pd
df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle', 'Joe'],
  'last_name': ['Connor', 'Connor', 'Reese', 'Bonnot'],
})

df[df.last_name.str.match('.*onno.*')]

## Pandas DataFrame: Query by variable value

Evaluate a variable as the value to find.

In [None]:
import pandas as pd
df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle'],
  'last_name': ['Connor', 'Connor', 'Reese'],
})

foo = 'Connor'
df.query('last_name == @foo')

## Pandas DataFrame: Query using variable value as a column name

Evaluate a variable, to use its value as the name of a column in a query.

E.g. Query for rows where `John` is the value in the column named `first_name`.

In [None]:
import pandas as pd
df = pd.DataFrame(data={
  'first_name': ['Sarah', 'John', 'Kyle'],
  'last_name': ['Connor', 'Connor', 'Reese'],
})

column_name = 'first_name'
df.query(f"`{column_name}` == 'John'")

## Pandas DataFrame: Query by Timestamp above a value

In [None]:
import pandas as pd
df = pd.DataFrame({
  'time': ['2022-09-14 00:52:00-07:00', '2022-09-14 00:52:30-07:00',
           '2022-09-14 01:52:30-07:00'],
  'letter': ['A', 'B', 'C'],
})
df['time'] = pd.to_datetime(df.time)

df.query('time >= "2022-09-14 00:52:30-07:00"')

## Pandas DataFrame: Query for Timestamp between two values

In [None]:
import pandas as pd
df = pd.DataFrame({
  'time': ['2022-09-14 00:52:00-07:00', '2022-09-14 00:52:30-07:00',
           '2022-09-14 01:52:30-07:00'],
  'letter': ['A', 'B', 'C'],
})
df['time'] = pd.to_datetime(df.time)

begin_ts = '2022-09-14 00:52:00-07:00'
end_ts = '2022-09-14 00:54:00-07:00'

df.query('@begin_ts <= time < @end_ts')

## Pandas DataFrame: Filter by Timestamp in DatetimeIndex using `.loc[]`

In [None]:
import pandas as pd
df = pd.DataFrame({
  'time': ['2022-09-14 00:52:00-07:00', '2022-09-14 00:52:30-07:00',
           '2022-09-14 01:52:30-07:00'],
  'letter': ['A', 'B', 'C'],
})
df['time'] = pd.to_datetime(df.time)
df.set_index('time', inplace=True)

df.loc['2022-09-14':'2022-09-14 00:53']

## Pandas DataFrame: Filter by Timestamp using TimeDelta string

In [None]:
import pandas as pd
df = pd.DataFrame({
  'time': ['2022-09-14 00:52:00-07:00', '2022-09-14 00:52:30-07:00',
           '2022-09-14 01:52:30-07:00'],
  'letter': ['A', 'B', 'C'],
})
df['time'] = pd.to_datetime(df.time)

def rows_in_time_range(df, time_column, start_ts_str, timedelta_str):
  # Return rows from df, where start_ts < time_column <= start_ts + delta.
  # start_ts_str can be a date '2022-09-01' or a time '2022-09-14 00:52:00-07:00'
  # timedelta_str examples: '2 minutes'  '2 days 2 hours 15 minutes 30 seconds'
  start_ts = pd.Timestamp(start_ts_str).tz_localize('US/Pacific')
  end_ts = start_ts + pd.to_timedelta(timedelta_str)
  return df.query("@start_ts <= {0} < @end_ts".format(time_column))

rows_in_time_range(df, 'time', '2022-09-14 00:00', '52 minutes 31 seconds')

## Pandas: Describe Timestamp values in a column

In [None]:
import pandas as pd
df = pd.DataFrame({
  'time': ['2022-09-14 00:52:00-07:00', '2022-09-14 00:52:30-07:00',
           '2022-09-14 01:52:30-07:00'],
  'letter': ['A', 'B', 'C'],
})
df['time'] = pd.to_datetime(df.time)

df['time'].describe(datetime_is_numeric=True)

## Pandas DataFrame: Explode a column containing dictionary values into multiple columns

This code transforms or splits the dictionary column into many columns.

E.g. The output DataFrame of this cell will have columns named [`date, letter, fruit, weather`].

In [None]:
import pandas as pd
df = pd.DataFrame({
  'date': ['2022-09-14', '2022-09-15', '2022-09-16'],
  'letter': ['A', 'B', 'C'],
  'dict' : [{ 'fruit': 'apple', 'weather': 'aces'},
            { 'fruit': 'banana', 'weather': 'bad'},
            { 'fruit': 'cantaloupe', 'weather': 'cloudy'}],
})

pd.concat([df.drop(['dict'], axis=1), df['dict'].apply(pd.Series)], axis=1)

## Pandas DataFrame: Extract values using regexp (regular expression)

In [None]:
import pandas as pd
df = pd.DataFrame({
  'request': ['GET /index.html?baz=3', 'GET /foo.html?bar=1'],
})

df['request'].str.extract('GET /([^?]+)\?', expand=True)

## Pandas Timestamp: Convert string to Timestamp, using date only

I.e. Midnight on the given date.

In [None]:
import pandas as pd

pd.Timestamp('9/27/22').tz_localize('US/Pacific')

## Pandas Timestamp: Convert string to Timestamp

In [None]:
import pandas as pd

pd.Timestamp('9/27/22 06:59').tz_localize('US/Pacific')

## Pandas: Create a TimeDelta using `unit`

From an integer.
`unit` is a string, defaulting to `ns`. Possible values:


In [None]:
import pandas as pd

pd.to_timedelta(1, unit='h')

## Pandas: Create a TimeDelta using available kwargs

Example keyworded args: {days, seconds, microseconds, milliseconds, minutes, hours, weeks}

In [None]:
import pandas as pd

pd.Timedelta(days=2)

## Pandas: Create a TimeDelta from a string

In [None]:
import pandas as pd

pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')

## Pandas: Replace NaN values in a Column

In [None]:
import numpy as np
import pandas as pd
df = pd.DataFrame({
  'dogs': [5, 10, np.nan, 7],
})

df['dogs'].replace(np.nan, 0, regex=True)

## Pandas DataFrame: Drop duplicate rows

In [None]:
import pandas as pd
df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle', 'Joe'],
  'last_name': ['Connor', 'Connor', 'Reese', 'Bonnot'],
})
df.set_index('last_name', inplace=True)

df.loc[~df.index.duplicated(), :]

## Pandas DataFrame: Ignore one Column

In [None]:
import pandas as pd
df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle', 'Joe'],
  'last_name': ['Connor', 'Connor', 'Reese', 'Bonnot'],
})

df.loc[:, df.columns!='last_name']

## Pandas DataFrame: Intersect Indexes

In [None]:
import pandas as pd
terminator_df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle'],
  'last_name': ['Connor', 'Connor', 'Reese'],
})
terminator_df.set_index('first_name', inplace=True)

buckaroo_df = pd.DataFrame({
  'first_name': ['John', 'John', 'Buckaroo'],
  'last_name': ['Parker', 'Whorfin', 'Banzai'],
})
buckaroo_df.set_index('first_name', inplace=True)

terminator_df.index.intersection(buckaroo_df.index).shape

## Pandas DataFrame: Select all rows from A that are not in B, using the index

In [None]:
import pandas as pd
terminator_df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle'],
  'last_name': ['Connor', 'Connor', 'Reese'],
})
terminator_df.set_index('first_name', inplace=True)

buckaroo_df = pd.DataFrame({
  'first_name': ['John', 'John', 'Buckaroo'],
  'last_name': ['Parker', 'Whorfin', 'Banzai'],
})
buckaroo_df.set_index('first_name', inplace=True)

terminator_df[~terminator_df.index.isin(buckaroo_df.index)]

## Pandas DataFrame: Select rows by an attribute of a column value

Use the Series `map()` method.
E.g. To filter by the length of a column values:

In [None]:
import pandas as pd
df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle'],
  'last_name': ['Connor', 'Connor', 'Reese'],
})

df[df['last_name'].map(len) == 5]

## Pandas DataFrame: Sort the count of rows grouped on columns

In [None]:
import pandas as pd
df = pd.DataFrame({
  'first_name': ['Sarah', 'John', 'Kyle'],
  'last_name': ['Connor', 'Connor', 'Reese'],
})

df.groupby(['last_name']).size().sort_values(ascending=False)

## Pandas DataFrame: Reshape to have 1 row per value in a list column

Creates a new DataFrame that is a transformed version of the input. E.g.
*   Input: df with a column named `msg_ids` that is a list of values (i.e. many per row, at least in some rows).
*   Output: new_df which has 1 row per unique value found in any of the original `msg_ids` lists, with that value in a new column named `msg_id`.


In [None]:
import pandas as pd
df = pd.DataFrame({
  'date': ['9/1/22', '9/2/22', '9/3/22'],
  'action': ['Add', 'Update', 'Delete'],
  'msg_ids': [[1, 2, 3], [], [2, 3]],
})
df.set_index('date', inplace=True)


temp_series = df['msg_ids'].apply(pd.Series, 1).stack()
temp_series.index = temp_series.index.droplevel(-1)
temp_series.name = 'msg_id'
new_df = temp_series.to_frame()
new_df.set_index('msg_id', inplace=True)
new_df.loc[~new_df.index.duplicated(), :] # Drop duplicates.

## Pandas: DataFrame: Create New Column based on difference between successive rows

Creates a new column with a value calculated on the difference between a specified column of each row and the column value from the previous row. Useful for integers and floats to show change over time (+/-).

Note the first row will result in `NaN` since it has nothing to compare to.

In [None]:
import pandas as pd
df = pd.DataFrame({
    'date': ['2023-10-22', '2023-10-21', '2023-10-20', '2023-10-19'],
    'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL'],
    'close': [181.13, 184.05, 186.14, 182.23]
})

df['close_diff'] = df['close'].diff()
df

## Pandas: DataFrames: Create New Column based on percentage rate of change between successive rows

Like the above entry you can calculate the difference between row values not simply as a +/- numeric change but also as a % rate of change. Useful for integers and floats to show change over time (+/-).

In [None]:
import pandas as pd
df = pd.DataFrame({
    'date': ['2023-10-22', '2023-10-21', '2023-10-20', '2023-10-19'],
    'ticker': ['AAPL', 'AAPL', 'AAPL', 'AAPL'],
    'close': [181.13, 184.05, 186.14, 182.23]
})

df['rate_change'] = df['close'].pct_change()
df

## Pandas: DataFrames: Group Timeseries by Frequency

You can group timestamped data into intervals of arbitrary duration using a Grouper object to specify groupby instructions.  The `freq` parameter is a string that may contain an integer followed by an [offset alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).  E.g. To see output for 2 minute long intervals:

In [None]:
import pandas as pd
df = pd.DataFrame({
  'time': ['2022-09-01 00:00:01-07:00', '2022-09-01 00:00:02-07:00',
           '2022-09-01 00:01:00-07:00', '2022-09-01 00:02:00-07:00',
           '2022-09-01 00:03:00-07:00', '2022-09-01 00:04:00-07:00',
           '2022-09-01 00:05:00-07:00', '2022-09-01 00:07:00-07:00'],
  'requests': [1, 1, 1, 1, 1, 1, 1, 1],
})
df['time'] = pd.to_datetime(df.time)

df.groupby(pd.Grouper(key='time', freq='2min')).sum()

## Pandas: Importing File Types

Pandas has methods available to import a variety of file types, whether local or remote. Remember that CSV data files can use a variety of separating characters beyond the comma.

- CSV: `pd.read_csv("path/to/file.csv")`. With CSV files you can specify a separator `,`, `;`, `|`, etc. by using the `sep=""` argument.
- JSON: `pd.read_json("path/to/file.json")`
- XML: `pd.read_xml("path/to/file.xml")`
- Excel: `pd.read_excel("path/to/file.xlsx")`

In [None]:
# Comma-separated values - is assumed by default

import pandas as pd

df = df.read_csv("albums.csv")

In [None]:
# Pipe-separated values

import pandas as pd

df = df.read_csv("albums.psv", sep="|")

In [None]:
# Semi-colon-separated values

import pandas as pd

df = df.read_csv("albums.csv", sep=";")