# Intro to Pandas
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec12_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook.

In [None]:
# Standard imports
import numpy as np
import pandas as pd

### Pandas Series

In [None]:
# Create a series from a dict (keys become indicies)
dct = {'one': 1, 'two': 2,'three':3}
dct_series = pd.Series(dct)
dct_series

### Accessing Elements

In [None]:
dct_series

In [None]:
# Access elements by their index (bracket notation)
dct_series['one']

In [None]:
# Access elements by their index (dot/attribute notation)
dct_series.one

In [None]:
# Accessing elements by their position
dct_series[0]

In [None]:
# Slicing by index (inclusive stop)
dct_series['one':'three']

In [None]:
# Slicing by position (exclusive stop)
dct_series[0:2]

In [None]:
# These different ways of accessing elements can get confusing if indices are ints
new_series = pd.Series({1:1,2:2,3:3,4:4,5:5})
new_series

In [None]:
# Are we accessing by index or position here?
new_series[1]

In [None]:
# Are we slicing by index or position here?
new_series[1:3]

Be explicit with `.loc` (for index-based access) and `.iloc` (for position-based access)

In [None]:
# Access by index
new_series.loc[1]

In [None]:
# Access by position
new_series.iloc[1]

In [None]:
# Slice by index
new_series.loc[1:3]

In [None]:
# Slice by position
new_series.iloc[1:3]

### Advanced indexing

In [None]:
# Select multiple elements by index
new_series.loc[[1,3]]

In [None]:
# Select multiple elements by position
new_series.iloc[[1,3]]

In [None]:
# Supports Boolean indexing
new_series[new_series % 2 == 0]

In [None]:
# Conditional indexing
new_series[(new_series > 2) & (new_series < 5)]

In [None]:
# Another way to do conditional indexing (specify inclusive as 'both','neither','left', or 'right')
new_series[new_series.between(2,5,inclusive='neither')]

In [None]:
# Select specific elements
new_series[(new_series == 1) | (new_series == 4)]

In [None]:
# Another way to select specific elements
new_series[new_series.isin([1,4])]

### General Series Info

In [None]:
# general info about the series
dct_series.info()

### Creating Pandas DataFrames

In [None]:
my_dict = {'fruit':['apple','banana','orange'],
          'color':['red','yellow','orange'],
          'yum_score':[5,5,5],
          'in fridge':[True, False, True]}

In [None]:
# dataframe from a dictionary (treats each key as a column)
fruit_df = pd.DataFrame(my_dict)
fruit_df

In [None]:
# dataframe from a dictionary (specify row labels)
fruit_df = pd.DataFrame(my_dict,index = np.arange(1,4))
fruit_df

In [None]:
# To change column labels
fruit_df.columns = ['fruit','color','yum_score','in_fridge']
fruit_df

In [None]:
# nested lists (or lists of tups)
lists = [[i,i**2,i**3] for i in range(10)]
lists

In [None]:
# dataframe from a list of lists (treats each sublist as a row)
pd.DataFrame(lists)

In [None]:
# specify column names 
pd.DataFrame(lists,columns = ['n','squared','cubed'])

In [None]:
# list of dictionaries (with same keys)
list_of_dicts = [
    {'Median Home Price': 454000, 'Town': 'Arcata'},
     {'Median Home Price': 383000, 'Town': 'Eureka'},
]
list_of_dicts

In [None]:
# Dataframe from list of dictionaries (treats each dict value as a row)
pd.DataFrame(list_of_dicts)

### Accessing DataFrame Elements

In [None]:
# accessing columns (dot notation)
fruit_df.color

In [None]:
# accessing columns (bracket notation)
fruit_df['color']

In [None]:
# accessing rows (by label)
fruit_df.loc[2]

In [None]:
# accessing rows (by position)
fruit_df.iloc[2]

In [None]:
# accessing elements (by label)
fruit_df.loc[2,'color']

In [None]:
# accessing elements (by position)
fruit_df.iloc[1,1]

In [None]:
# slicing (by label)
fruit_df.loc[1:2,['fruit','color']]

In [None]:
# slicing (by position)
fruit_df.iloc[0:2,0:2]

In [None]:
# Subsetting columns (by column label)
fruit_df[['fruit','yum_score']]

In [None]:
# Subsetting columns (by column position)
fruit_df.iloc[:,[0,2]]

### Dataframe attributes

In [None]:
# data type of elements in each column
fruit_df.dtypes

In [None]:
# shape (2d)
fruit_df.shape

In [None]:
# row labels
fruit_df.index

In [None]:
# column labels
fruit_df.columns

In [None]:
# all the values output as numpy array (note dtype)
fruit_df.values

### General Info

In [None]:
# general info
fruit_df.info()

## Why Pandas?
NumPy is nice for handling homogeneous data types, but sometimes we need more flexibility as data become more complicated. We might also desire visually pleasing way to view the data.  

In [None]:
# Sample data (made up employees)
employee_data = np.array([
    [101, 'John', 'Engineering', 60000, '2018-01-15'],
    [102, 'Jane', 'Engineering', 65000, '2017-05-12'],
    [103, 'Doe', 'HR', 55000, '2019-02-28'],
    [104, 'Alice', 'Marketing', 70000, '2016-11-20'],
    [105, 'Bob', 'HR', 60000, '2019-09-10'],
    [106, 'Eve', 'Marketing', 75000, '2017-04-05']
])
print(employee_data)

# Same data in Pandas dataframe
employee_df = pd.DataFrame(employee_data, columns=['ID', 'Name', 'Department', 'Salary', 'Hire Date'])
employee_df['Salary'] = pd.to_numeric(employee_df['Salary'])
employee_df

In [None]:
# Get the average salary by department

# Find unique departments
unique_departments = np.unique(employee_data[:, 2])

# Calculate average salary for each department
avg_salaries = []
for department in unique_departments:
    department_salaries = employee_data[employee_data[:, 2] == department, 3].astype(float)
    avg_salaries.append(np.mean(department_salaries))

print(unique_departments)
print(avg_salaries)

In [None]:
# Do the same task with Pandas
avg_salaries = employee_df.groupby('Department')['Salary'].mean()
avg_salaries

## Activity 

Consider the following data:

|Pet name | Species| Age| Adoption Fee|
|---------|--------|----|-------------|
|Whiskers | Cat    | 3  | 25.00       |
|Bubbles  | Fish   | 1  | 3.00        |
| Rover   | Dog    | 2  | 75.00       |
| Hopper  | Bunny  | 2  | 15.00       |


Create a Pandas dataframe containing the data above. 

Select and display only the `Pet Name` and `Species` columns.

Select and display only the first two rows of the dataframe.

Select and display the pets that have an adoption fee less than $20. 