<a href="https://colab.research.google.com/github/almazafa/HTML/blob/main/BIU_%20Exercise%2012.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas: Cheat sheet

In [None]:
import pandas as pd
import seaborn as sns

# Load Titanic dataset
df = sns.load_dataset('titanic')  # seaborn comes with many datasets for purposes of demonstration
df.head()  # Displays the first 5 rows of the DataFrame

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
# Indexing: Accessing a column
df['age']  # Accesses the 'age' column --> Series
df[['age', 'sex']]  # Accesses the 'age' and 'sex' columns --> DataFrame
df.age  # Acessing the age column using the attribute access "shortcut"

# Sorting: Sort by a column
df.sort_values(by='fare')  # Sorts the DataFrame by the 'fare' column

# Masking: Filter rows based on a condition
df[df['sex'] == 'male']  # Filters rows where the 'sex' column is 'male'
df[df.sex == 'male']    # Same as above, with the shortcut

# value_counts: Count unique values in a column
df.value_counts('sex')  # Counts occurrences of each gender

# info: Summary of the DataFrame
# df.info()  # Provides summary information about the DataFrame

# describe: Statistical summary of numeric columns
df.describe()  # Provides a statistical summary for numeric columns

# head: Top rows of the DataFrame
df.head()

# tail: Bottom rows of the DataFrame
df.tail()  # Displays the last 5 rows of the DataFrame

# drop: Remove columns or rows
df.drop(columns=['deck'], inplace=True)  # Removes the 'deck' column

# rename: Rename columns
df.rename(columns={'pclass': 'passenger_class'}, inplace=True)  # Renames 'pclass' to 'passenger_class'

# fillna: Fill NA/NaN values
df['age'].fillna(df['age'].mean(), inplace=True)  # Fills NA values in 'age' with the mean age

# isna: Detect missing values
df.isna()  # Returns a boolean DataFrame indicating missing values

# notna: Detect existing (non-missing) values
df.notna()  # Returns a boolean DataFrame indicating non-missing values

# groupby: Group DataFrame using a column
df.groupby('sex').mean(numeric_only=True)  # Groups data by 'sex' and calculates the mean for each numeric column

# aggregations: Apply multiple aggregation operations
df.groupby('sex').agg({'age': 'mean', 'fare': 'sum'})  # Groups by 'sex', then calculates mean age and sum of fare

pass

In [None]:
# Adding columns in a mutating fashion
df['family_on_board'] = df.sibsp + df.parch  # Adds a new column 'family_on_board' as the sum of 'sibsp' and 'parch'

# Adding columns using the assign method (non-mutating)
df = df.assign(total_fare_per_person=df.fare / (df.family_on_board + 1))  # Adds 'total_fare_per_person' column

# Give the first five people names and set as index
first_five_names = ['John', 'Alice', 'Maria', 'James', 'Emma']
df_first_five = df.head().assign(name=first_five_names).set_index('name')  # Assigns names and sets them as index

# Using .loc to access rows by label (name)
df_loc_by_name = df_first_five.loc['John']  # Accesses the row labeled 'John'

# Using .loc for slicing rows (by label)
df_sliced_by_label = df_first_five.loc['John':'Maria']  # Selects rows from 'John' to 'Maria'

# Using .loc to access specific row and column by label
age_of_john = df_first_five.loc['John', 'age']  # Accesses 'age' for the row labeled 'John'

# Using .loc for accessing rows and columns together by labels
df_rows_cols_labels = df_first_five.loc['John':'Maria', ['age', 'fare']]  # Selects 'age' and 'fare' for 'John' to 'Maria'

# Reset index
df_reset = df_first_five.reset_index()  # Resets the index, turning 'name' back into a column

pass

# Pandas' `inplace=True` and the Principle of "Well-Behaved" Functions



When working with pandas, you'll often encounter the `inplace` parameter in various methods. This parameter is directly tied to the concept of "well-behaved" functions in programming. Let's explore what this means in the context of pandas.

## The Principle of "Well-Behaved" Functions:

The principle of "well-behaved" functions in programming categorizes functions into two main types:

1. **PURE FUNCTION:** This type of function returns a value and does not alter the state of the world (i.e., it does not have side effects). It's predictable and doesn't rely on, nor does it modify, the state of external variables. Examples include:
   - `len()`: Returns the length of an object.
   - `type(object)`: Returns the type of an object.
   - `str.upper()`: Returns a new string in uppercase.
   - `str.replace(substring, replacement)`: Returns a new string with replaced values.

2. **MUTATING FUNCTION:** This function changes or affects the world (i.e., it has side effects) and typically returns `None`. These functions alter the state of an object or the environment. Examples include:
   - `print`: Prints to the console (a side effect) and returns `None`.
   - `list.append(item)`: Adds an item to a list, modifying it.
   - `list.remove(item)`: Removes an item from a list, altering it.

## The `inplace` Parameter in Pandas:

In pandas, many functions come with an `inplace` parameter, often set by default to `False`. This parameter decides whether the operation should mutate the original DataFrame (`inplace=True`) or return a new modified DataFrame (`inplace=False`).

- **`inplace=False` (Default Behavior):**
  - Adheres to the **PURE FUNCTION** principle.
  - Returns a new DataFrame and leaves the original DataFrame unchanged.
  - Example:
  ```python
  # returns a new sorted DataFrame, original `df` remains unsorted.
  df2 = df.sort_values(by='fare') # df2 is sorted, df is not
  ```

- **`inplace=True`:**
  - Follows the **MUTATING FUNCTION** principle.
  - Directly modifies the original DataFrame and returns `None`.
  - Example:
  ```python
  # removes the 'deck' column from `df` and does not return a new DataFrame.
  df.drop(columns=['deck'], inplace=True) # returns None
  ```

## Best Practices :

- **Predictability:** Using `inplace=False` (the default) is more predictable as it doesn’t alter the original DataFrame. It’s generally safer, especially for beginners or in exploratory data analysis.
- **Memory Considerations:** While `inplace=True` might be more memory efficient as it doesn't create a new copy, it can lead to unintended consequences if not used cautiously.
- **Chainability:** Functions with `inplace=False` are chainable, i.e., you can chain multiple operations in a single line. This is not possible with `inplace=True`.

### Chaining Example:



In [None]:
# Load Titanic dataset
df = sns.load_dataset('titanic')

# Chaining operations with annotations
result = (df[['survived', 'pclass', 'sex', 'fare']]  # Select only relevant columns
          .groupby(['pclass', 'sex'])  # Group data by passenger class and sex
          .agg({'survived': 'mean', 'fare': 'mean'})  # Aggregate: calculate mean survival rate and fare
          .rename(columns={'survived': 'average_survival_rate', 'fare': 'average_fare'})  # Rename columns for clarity
         )

# df is unchanged
result

Unnamed: 0_level_0,Unnamed: 1_level_0,average_survival_rate,average_fare
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,0.968085,106.125798
1,male,0.368852,67.226127
2,female,0.921053,21.970121
2,male,0.157407,19.741782
3,female,0.5,16.11881
3,male,0.135447,12.661633


# Chaining in Python: An Expressive Approach to Coding


> Chaining, sometimes referred to as **"flow style"** in Python, is a powerful coding technique that involves calling multiple methods sequentially in a single, coherent highly-readable expression.

chaining is about calling one method after another, where each method returns an object that the next method can act upon. This technique is particularly popular in data manipulation libraries like pandas but is applicable in many other contexts as well.

```python
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Chaining operations
result = (df
          .assign(C=lambda x: x['A'] + x['B'])  # Add a new column 'C' as the sum of 'A' and 'B'
          .query('C > 5')                     # Filter rows where 'C' is greater than 5
          .sort_values(by='C')                # Sort the DataFrame by column 'C'
         )

result
```

## Making Chaining Possible: Parentheses and Alignment
To handle breaking an expression into multiple lines, Python's syntax offers a neat solution: **wrapping the entire chain in parentheses**

## History and Evolution of Chaining
Chaining as a concept is not new and has its roots in functional programming languages (Lisp, Scheme, Haskell), where it's common to pass the output of one function directly into another. Over time, this style has been adopted more widely especially in context of data manipulation and analysis.

## Benefits of Chaining
1. **Readability:** When done well, chaining can lead to code that is more straightforward and easier to understand at a glance.
2. **Maintainability:** Chaining often results in less verbose code, reducing the cognitive load required to maintain it.
3. **Expressiveness:** It allows programmers to express complex operations in a coherent and linear fashion.


# The `axis` Parameter in Pandas



The `axis` parameter in pandas offers flexibility in applying functions across different dimensions of a DataFrame.

- **`axis=rows`** - applies operation over rows. alt

We'll explore this parameter using the Penguins or Iris dataset, highlighting the use of `axis=0`, `axis=1`, `axis='rows'`, and `axis='columns'`. Additionally, we'll touch upon the alternative usage of `index=` and `columns=` parameters in some functions.

#### Loading the Dataset
For demonstration, we'll use the Penguins dataset, which can be loaded using seaborn:

```python
import seaborn as sns
df = sns.load_dataset('penguins')
```

#### Axis Parameter Examples

1. **Sum/Average Across Rows (`axis=0` or `axis='index'`):**
   This calculates the sum or average for each column, collapsing the rows.

   ```python
   # Calculate the mean of numeric columns, collapsing across rows (default behavior)
   mean_values = df.mean(axis=0)
   # Or equivalently
   mean_values = df.mean(axis='index')
   ```

2. **Sum/Average Across Columns (`axis=1` or `axis='columns'`):**
   This operation calculates the sum or average for each row, collapsing the columns.

   ```python
   # Assuming the dataset has numerical columns like 'bill_length_mm', 'bill_depth_mm'
   df['mean_measurements'] = df.mean(axis=1)
   # Or equivalently
   df['mean_measurements'] = df.mean(axis='columns')
   ```

3. **Dropping Columns (`axis=1`):**
   To remove a column, set `axis=1`. This indicates the operation should occur column-wise.

   ```python
   # Drop a column
   df_dropped = df.drop('species', axis=1)
   ```

4. **Dropping Rows (`axis=0`):**
   To drop rows, either by label or condition, use `axis=0`.

   ```python
   # Drop rows where 'species' is NaN
   df_dropped_rows = df.dropna(subset=['species'], axis=0)
   ```

5. **Applying a Function (`apply` Method):**
   The `apply` function in pandas can also use the `axis` parameter to apply a function across different axes.

   ```python
   # Apply a function across rows
   df['total_measurements'] = df.apply(lambda row: row['bill_length_mm'] + row['bill_depth_mm'], axis=1)
   ```

6. **Alternative Usage (`index=` and `columns=`):**
   Some pandas functions offer `index=` and `columns=` as alternatives to `axis`. These parameters are more explicit and can enhance code readability.

   ```python
   # Using the `rename` method as an example
   df_renamed = df.rename(index={0: 'FirstRow'}, columns={'species': 'SpeciesType'})
   ```

#### Understanding `axis`:
- `axis=0` or `axis='index'`: The function is applied vertically, along the DataFrame's rows. It's the default in many methods.
- `axis=1` or `axis='columns'`: The function is applied horizontally, across the DataFrame's columns.

### Conclusion
The `axis` parameter is a versatile tool in pandas, enabling succinct and powerful data manipulations across different dimensions of a DataFrame. By understanding and correctly applying `axis`, along with its alternative `index=` and `columns=` in some contexts, data analysis in pandas becomes more efficient and intuitive.