Python and Science - https://github.com/egalli64/pysci

Kaggle Courses - Pandas - https://www.kaggle.com/learn/pandas

Indexing, Selecting & Assigning - https://www.kaggle.com/code/residentmario/indexing-selecting-assigning

In [159]:
# Setup /1: only pandas is used here
import pandas as pd

In [160]:
# Setup /2: generate the data frame used for examples

reviews = pd.DataFrame({
    'country': ['Italy', 'Portugal', 'US', 'Italy', 'Belgium', 'France'], 
    'apples': [35, 41, 34, 18, 27, 32], 
    'bananas': [21, 34, 54, 21, None, 43],
    'points': [85, 88, 87, 92, 81, 95]
})

Native accessors

In [None]:
# The full data frame
reviews

In [None]:
# Accessing a data frame column by name
reviews.country


In [None]:
# Accessing a data frame column as an associative array
reviews['country']

In [None]:
# A data frame column is a series, and its elements could be accessed as an associative array too
reviews['country'][0]

Indexing by iloc - index-based selection

If on an interval, is a _right open_ one

In [None]:
# Get the first rows
reviews.iloc[0]

In [None]:
# Get the first column (actually, all rows - by slice - for column 0)
reviews.iloc[:, 0]

In [None]:
# Get first column, slicing rows 1 and 2
reviews.iloc[1:3, 0]

In [None]:
# Get first column, picking rows 0 and 2 - by list
reviews.iloc[[0, 2], 0]

In [None]:
# Get the full last two rows
reviews.iloc[-2:]

Indexing by loc - Label-based selection

If on an interval, is a _closed_ one

In [None]:
# Get value in row labeled 0, column labeled 'country'
reviews.loc[0, 'country']

In [None]:
# Labels are usually more readable than indices
reviews.loc[:, ['country', 'points']]

Manipulating the index

In [None]:
# change the index
reviews.set_index("points")

In [173]:
# change the index in-place
reviews.set_index("points", inplace=True)

In [None]:
# using the new index (points) to get the 85 .. 87 rows (notice the close, unordered interval)
reviews.loc[85:87,:]

In [None]:
# reset the index to default

# 1. make the index a column
reviews.reset_index(inplace=True)

# add a new column to the data frame
reviews['MyIndex'] = range(len(reviews))

# set the new column as index
reviews.set_index('MyIndex', inplace=True)

# get rid of the temporary name
reviews.index.name = None

reviews

Conditional selection

In [None]:
# Check each row if its country is Italy, a new Series is generated with boolean values
reviews.country == 'Italy'

In [None]:
# using the resulting series to get only the rows of interest
reviews.loc[reviews.country == 'Italy']

In [None]:
# use a single ampersand as logical AND
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

In [None]:
# use a single pipe as logical OR
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

In [None]:
# the isin() conditional operator could be more readable
reviews.loc[reviews.country.isin(['Italy', 'France'])]

In [None]:
# notnull() filter to keep the rows with good values
reviews.loc[reviews.bananas.notnull()]

In [None]:
# isnull() filter to keep the rows with missing values
reviews.loc[reviews.bananas.isnull()]

In [None]:
# adding a new column with a constant in it (same for each row)
reviews['critic'] = 'everyone'
reviews

In [None]:
# adding a new column with a value from an iterabile in it
reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews