# <font color="green">Selecting Rows and Columns</font>

----------------------------------



## Columns: Selecting and Working with Columns from a DataFrame

### Selecting Columns
- You can select columns in Pandas using simple indexing with column names or with the `df[['column1', 'column2']]` syntax for multiple columns.
- Selecting columns is a fundamental skill because it is often the first step in many data manipulation tasks.

### Working with Columns
- Once columns are selected, they can be used for various operations such as computation, aggregation, and combination with other data.
- Tasks such as renaming columns, handling missing values, or changing the data type of columns are essential for further analysis.

## Rows: Understanding the Difference Between `loc` and `iloc` Methods, and Selecting Rows from a DataFrame

### Difference Between `loc` and `iloc`
- **`loc`**: Used to access a group of rows and columns by labels or a boolean array. It is label-based, requiring specification of the row and column labels.
- **`iloc`**: Used for integer index-based access, requiring row and column indices for selection.

### Selecting Rows
- Selecting rows is a process akin to selecting columns, typically accomplished using `loc` or `iloc`.
- This is especially useful for filtering data based on some condition or when dealing with a specific range of indices.

Understanding and mastering the manipulation of rows and columns in Pandas is pivotal for effective data cleaning, preparation, and analysis. These skills are integral to performing a broad spectrum of data analysis tasks with efficiency and precision.


------------------

### Key Concepts
It is an essential part of working with data to able to select specific parts of a dataset. For example in order to fill in missing data in a particular column(s) this skill comes in very handy. This means that you should be very comfortable with the syntax of selecting rows and columns.

| Command                       | Description                                           |
|-------------------------------|-------------------------------------------------------|
| df[col]                       | select one column as a Series                         |
| df[[col]]                     | select one column as a DataFrame                      |
| df[[col1, col2, ... ]]        | select 2+ columns as a DataFrame                      |
| df['column_name'] = new_values| assign new values to the column                        |
| df.drop()                     | drop specified rows or columns                        |
| df['column'].astype()         | cast a pandas column to a specified dtype             |
| df.loc[row]                   | select one row as a Series by index                   |
| df.loc[[row1, row2]]          | select 1+ rows as a DataFrame by index                |
| df.loc[[row], [col]]          | select rows and columns as a DataFrame by index       |
| df.iloc[a:b, c:d]             | select rows/columns by integer-location               |
| df.set_index()                | set selected column as index                           |


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./data/penguins_simple.csv', sep=";")

In [None]:
# look at the dataframe

df

In [None]:
# check out the columns
df.columns

In [None]:
#  first n rows (default 5)
df.head()

In [None]:
#  last n rows (default 5)
df.tail()

## 1. Columns

### 1.1 Renaming the columns

In [None]:
df.columns =['species', 'culmen_length_mm','culmen_depth_mm', 'flipper_length_mm', 'body_mass_gg', 'sex']

In [None]:
df

### 1.2 Selecting a single column

In [None]:
# using single square brackets, we select a single column as a series
df['sex']

This will extract the column as a pd.Series.

In [None]:
type(df['sex']) # as we learned it in the pandas encounter, it is pandas series

In [None]:
# using double square brackets, we select a single columns as a dataframe
df[['sex']]

In [None]:
type(df[['sex']])

In [None]:
# why? because if we want to call more then one column we would pass a list of column lables ['sex', '...'] inside df[...] 
# so it would be something like df[['sex', '...']]
# and pandas would need to return a dataframe, because doh! multiple columns are not one Series :)
# with just one column in a list forces the system to show a dataframe with columns from the list with just one column.
df[     ['sex']      ]

### 1.3 Selecting Multiple Columns

In [None]:
# when selecting multiple columns, we HAVE to use double square brackets
# and we get a dataframe back
df[['species','culmen_length_mm','sex']]

### 1.4 Changing column types

In [None]:
df.dtypes # Notice that dataframes may contain different data types and that's why here it is dtypes 


In [None]:
# we sometimes need to change the data type of a column, that is usually part of data cleaning

df['flipper_length_mm'] = df['flipper_length_mm'].astype(int)

In [None]:
# Let's check the datatypes again
df.dtypes

###  1.5 Creating columns

In [None]:
# Adding a new column
# For example, let's add a 'year_collected' column with a constant value
df['year_collected'] = 2024

In [None]:
df

#### let's create another column for mody_mass... but in kilogramm

> SIDEBAR: How to round numbers in python:
`round(<number_to_round>, <number_of_decimal_points>)`

In [None]:
# Convert body_mass_g from grams to kilograms, round to 2 decimal places, and create a new column

df['body_mass_kg'] = (df['body_mass_gg'] / 1000.0).round(2)

In [None]:
# Now, df includes a new column 'body_mass_kg' with the body mass in kilograms, rounded to 2 decimal places
df

### 1.6 Dropping columns

In [None]:
# dropping a single column

df.drop('year_collected', axis='columns')
#df.drop('year_collected', axis=1)

In [None]:
# note that the dataframe did not change!
df

Uh-oh! Dropping returned a (changed) copy of the dataframe, but didn't change the original!

To make the changes stick, you can:
* assign the result to another dataframe (or the same if you want to overwrite it)
* use the `inplace=True` parameter

In [None]:
# assign the result to another datafram
df_new = df.drop('year_collected', axis=1) # notice that instead of columns we used axis=1 which is the same thing

In [None]:
df_new

In [None]:
# to drop from the dataframe we use the inplace=True parameter
df.drop('year_collected', axis='columns', inplace=True)

In [None]:
df

## 2. Selecting rows (and columns): `.loc[]` and `.iloc[]` methods

A brief slicing recap:

In [None]:
a = [1,2,3,4,5,6]

In [None]:
a

Reminder: slicing syntax is `a[start:end:step]`
* If not specified, `start` is the beginning of the list.
* If not specified, `end` is the end of the list.
* You can use minus to count from the back, e.g. the second element from the back is `a[-2]`
* If not specified, `step` is 1.

In [None]:
# select first 4 numbers
a[:4]

In [None]:
a[1:4]

In [None]:
# what does this do?
a[::3]
# it gives us every third element in the list

## Selecting rows with .loc[row_index]

### Here's a quick guide on how to use `loc[]`:

`loc[]` is a versatile selection method in Pandas that allows you to specify rows and columns to access based on their labels.

- **Select a single row:** `df.loc['index_label']`   
    Accesses the row with the specified label.  


- **Select multiple rows:** `df.loc[['label1', 'label2']]`  
    Retrieves multiple rows with the given labels.  


- **Select rows by range of labels:** `df.loc['label1':'label3']`   
    Selects all rows between and <u>including</u> the specified label range.  


- **Conditional selection:** `df.loc[df['column'] > value]`  
    Filters rows <u>based on a condition</u> applied to column values.  


- **Select specific rows and columns:** `df.loc[['row1', 'row2'], ['column1', 'column2']]`  
   Selects particular rows and columns by specifying their labels.  

#### Important Notes:
- `loc[]` operates on the DataFrame's index labels and column names. It is particularly well-suited for DataFrames where these labels are meaningful.
- It's often used when the operations are based on the data's content rather than its position.
- When using `loc[]` to select rows by a range of labels, the end label is <u>included</u> in the output, which is different from typical Python slicing.

Mastering `loc[]` will enhance your data manipulation capabilities, allowing you to efficiently access and modify your data based on its labels.




In [None]:
df

In [None]:
# Example 1: Select the first row and all columns
df.loc[0, :]

In [None]:
# Example 2: Select first two rows and all columns
df.loc[0:1, :]

In [None]:
df

In [None]:
# Example 3: Select rows where the condition is True
df.loc[df['culmen_length_mm'] > 40] 

In [None]:
# Example 4: Select all rows where the species is "Adelie" and assigning the subset of the dataframe to a new variable:

adelie_penguins = df.loc[df['species'] == 'Adelie']
adelie_penguins

### Closer Look: How does loc[] work?

**Notice that `df.loc[0:1]` is including both 0 and 1 rows.  
Loc is using the labels of the row.**

In [None]:
# recap Example 2: Select first two rows and all columns
df.loc[0:1]

**Let's make some changes to understand it better**

In [None]:
# changing the index column of the dataframe
df.set_index('species', inplace=True) 

# It is also possible to assign a column as index, when we are reading the data


In [None]:
df

In [None]:
# Example 1: Select the second row

# we will get an error now, because we changed the index labels to species names.
df.loc[1, :] 

In [None]:
df.loc['Adelie', :] # the new index labels are the species names. loc uses the labels.

In [None]:
# Example 2: Select rows that contain Adelie and Chinstrap Species
df.loc['Adelie':'Chinstrap', :]

## Selecting rows with .iloc[integer_position]

+ Te iloc[] function in pandas is used for integer-location based indexing, which means it selects rows and columns using integer indices. Here are some scenarios when you might use iloc():

## iloc[] Usage Guide

`iloc[]` provides flexibility in data selection within pandas DataFrames through integer indexing. Below are some common ways to utilize `iloc[]`:
 
- **Select a Single Row**: `df.iloc[5]`  
  This accesses the sixth row in the DataFrame since indexing starts at 0.   


- **Select Multiple Rows**: `df.iloc[5:10]`  
  Retrieves rows 6 to 10, including the start index but <u>excluding</u> the end index.  


- **Select Rows and Specific Columns**: `df.iloc[5:10, 0:2]`  
  Selects rows 6 to 10 and the first two columns.  


- **Select Specific Rows and All Columns**: `df.iloc[[1, 3, 7], :]`  
  Accesses rows 2, 4, and 8 across all columns.  
 
    
## Additional Tips

- Remember that `iloc[]` uses zero-based indexing, similar to indexing in native Python lists or numpy arrays.

- The end index in a range is excluded, aligning with standard Python slicing notation.

- With `iloc[]`, you can also access rows and columns in reverse order by employing negative indices.


In [None]:
df.iloc[5] #(selects the row at position 5)

In [None]:
df.iloc[5:10] #(selects rows at positions 5 to 9)

In [None]:
df.iloc[5:10, 0:2]  #(selects rows at positions 5 to 9 and columns at positions 0 to 1)

In [None]:
df.iloc[[1, 3, 7], :] #(selects rows at positions 1, 3, and 7, along with all columns)

# Recap: .loc vs .iloc in Pandas

Pandas offers two powerful methods for data selection within DataFrames, `.loc` and `.iloc`, each catering to different selection criteria:

## .loc Method (Label-based Selection)

- **Label-based Selection**: Utilizes row and column **labels** for data access.
- **Syntax**: `df.loc[row_label, column_label]`
  - You specify the rows and columns by using their labels.
  - This method is ideal when you know the exact labels of the rows and columns you're interested in.

## .iloc Method (Integer Position-based Selection)

- **Integer Position-based Selection**: Relies on **integer position values** (0-based) for selecting rows and columns.
- **Syntax**: `df.iloc[row_position, column_position]`
  - Rows and columns are specified by their integer index positions.
  - This approach is useful when you want to access data based on its position in the DataFrame, similar to indexing in Python lists or arrays.
