In [2]:
import pandas as pd
df = pd.read_csv("titanic.csv")

---
---

# Data Exploration
The following commands can be used to explore and understand the data before applying analysis to it.

---
---

## `read_csv()`

Pandas provides the `read_csv()` method to load data from a CSV file into a DataFrame.

`pd.read_csv(filepath)`

- **Purpose**: Reads a CSV file into a DataFrame.
- **Parameters**:
  - `filepath` (required): The path to the CSV file you want to load.
  - `sep` (optional): The delimiter to use. By default, it's a comma (`,`).
  - `header` (optional): Row number(s) to use as the column names. Default is 0 (the first row).
  - `index_col` (optional): Column(s) to set as the DataFrame's index.
  - `usecols` (optional): A list of column names to read. Useful for loading specific columns.
  - `dtype` (optional): Data type for the columns. Can be a dictionary to specify types per column.
  - `na_values` (optional): Additional strings to recognize as NA/NaN.
  
- **Usage Example**:
  ```python
  df = pd.read_csv('data.csv')  # Loads the CSV file 'data.csv' into a DataFrame
  df = pd.read_csv('data.csv', sep=';', header=0, index_col=0)  # Loads CSV with semicolon delimiter, using the first column as index


## `read_json()`

Pandas provides the `read_json()` method to load data from a JSON file or JSON string into a DataFrame.

`pd.read_json(filepath_or_buffer)`

- **Purpose**: Reads a JSON file or JSON string into a DataFrame.
- **Parameters**:
  - `filepath_or_buffer` (required): The path to the JSON file or the JSON string you want to load.
  - `orient` (optional): The expected format of the JSON string. Default is `auto`. Other options include `split`, `records`, `index`, `columns`, and `values`.
  - `typ` (optional): Type of object to return. Default is `'frame'` to return a DataFrame. Can also be `'series'` for a Series.
  - `dtype` (optional): Data type for the resulting DataFrame columns. Can be a dictionary to specify types per column.
  - `convert_dates` (optional): Whether to convert the date columns into datetime objects. Defaults to `True`.

- **Usage Example**:
  ```python
  df = pd.read_json('data.json')  # Loads the JSON file 'data.json' into a DataFrame
  df = pd.read_json('data.json', orient='records')  # Loads JSON using 'records' format


## `read_table()`

Pandas provides the `read_table()` method to load data from a general text file into a DataFrame.

`pd.read_table(filepath_or_buffer)`

- **Purpose**: Reads a general delimited text file into a DataFrame.
- **Parameters**:
  - `filepath_or_buffer` (required): The path to the text file or a buffer.
  - `sep` (optional): The delimiter used to separate fields in the file. Default is tab (`\t`), making it ideal for tab-separated files.
  - `header` (optional): Row number(s) to use as the column names. Default is 0 (the first row).
  - `index_col` (optional): Column(s) to set as the DataFrame's index.
  - `dtype` (optional): Data type for the resulting DataFrame columns. Can be a dictionary to specify types per column.
  - `na_values` (optional): Additional strings to recognize as NA/NaN.

- **Usage Example**:
  ```python
  df = pd.read_table('data.txt')  # Loads a tab-separated file 'data.txt' into a DataFrame
  df = pd.read_table('data.txt', sep=',', index_col=0)  # Loads a file with comma delimiter and uses the first column as index


---
---

# Data Exploration
The following commands can be used to explore and understand the data before applying analysis to it.

---
---

## `.head()` and `.tail()`

Pandas provides two useful methods for examining the top and bottom rows of a DataFrame: `head()` and `tail()`.

`df.head(n)`

- **Purpose**: Returns the first/last `n` rows of the DataFrame.
- **Parameters**:
  - `n` (optional): The number of rows you want to observe from the top of the dataset. If not specified, it defaults to 5.
- **Usage Example**:
  ```python 
  df.head(10)  # Returns the first 10 rows of the DataFrame
  ```

In [3]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## `.sample()`

Pandas provides the `sample()` method to randomly select a sample of rows from a DataFrame.

`df.sample(n)`

- **Purpose**: Returns a random sample of `n` rows from the DataFrame. This is useful for quickly examining a subset of the data.
- **Parameters**:
  - `n` (optional): The number of rows to return. If not specified, it defaults to 1.
  - `frac` (optional): A fraction of rows to return. If specified, `n` should not be used. (frac = 0.1, return 10% of the rows)
  - `replace` (optional): Whether to allow sampling of the same row more than once (`True`) or not (`False`). The default is `False`.
  - `random_state` (optional): A seed for the random number generator for reproducibility. 
  - `axis` (optional): Axis to sample from. Defaults to 0 (rows). Set to 1 for sampling columns.
- **Usage Example**:
  ```python
  df.sample(5)  # Returns a random sample of 5 rows from the DataFrame
  df.sample(frac=0.1)  # Returns a random sample of 10% of the rows
  df.sample(n=3, random_state=42)  # Returns 3 random rows, with a fixed seed for reproducibility


In [4]:
df.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
492,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5,C30,S
843,844,0,3,"Lemberopolous, Mr. Peter L",male,34.5,0,0,2683,6.4375,,C
572,573,1,1,"Flynn, Mr. John Irwin (""Irving"")",male,36.0,0,0,PC 17474,26.3875,E25,S


In [5]:
df.sample(frac = 0.01)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
385,386,0,2,"Davies, Mr. Charles Henry",male,18.0,0,0,S.O.C. 14879,73.5,,S
719,720,0,3,"Johnson, Mr. Malkolm Joackim",male,33.0,0,0,347062,7.775,,S
268,269,1,1,"Graham, Mrs. William Thompson (Edith Junkins)",female,58.0,0,1,PC 17582,153.4625,C125,S
476,477,0,2,"Renouf, Mr. Peter Henry",male,34.0,1,0,31027,21.0,,S
639,640,0,3,"Thorneycroft, Mr. Percival",male,,1,0,376564,16.1,,S
391,392,1,3,"Jansson, Mr. Carl Olof",male,21.0,0,0,350034,7.7958,,S
787,788,0,3,"Rice, Master. George Hugh",male,8.0,4,1,382652,29.125,,Q
404,405,0,3,"Oreskovic, Miss. Marija",female,20.0,0,0,315096,8.6625,,S
552,553,0,3,"O'Brien, Mr. Timothy",male,,0,0,330979,7.8292,,Q


## `.shape`

Pandas provides the `shape` attribute to quickly check the dimensionality of a DataFrame.

`df.shape`

- **Purpose**: Returns a tuple representing the dimensions of the DataFrame, (no. rows, no. columns).
- **Parameters**: 
  - None. `shape` is an attribute, not a method, so it doesn't take any arguments.
- **Usage Example**:
  ```python
  df.shape  # Returns a tuple (number_of_rows, number_of_columns)


In [6]:
df.shape # 891 rows, 12 columns

(891, 12)

## `.columns`

Pandas provides the `columns` attribute to get the column labels of a DataFrame.

`df.columns`

- **Purpose**: Returns an index object containing the column labels of the DataFrame. 
- **Parameters**: 
  - None. `columns` is an attribute, not a method, so it doesn't take any arguments.
- **Usage Example**:
  ```python
  df.columns  # Returns an index of the column labels


In [7]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## `.info()`

Pandas provides the `info()` method to quickly get a summary of the DataFrame.

`df.info()`

- **Purpose**: Provides a concise summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.
- **Parameters**:
  - `verbose` (optional): Whether to print the full summary (`True`) or a truncated one (`False`). Defaults to `None`, which automatically decides based on the number of columns.
  - `max_cols` (optional): Specifies the maximum number of columns to display. Defaults to `None`.
- **Usage Example**:
  ```python
  df.info()  # Prints a concise summary of the DataFrame


In [8]:
df.info() # Embarked, Age, Cabin has some null values (as their "non-null" values less than 891)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## `.describe()`

Pandas provides the `describe()` method to generate descriptive statistics for the DataFrame. Used <u>specifically for numerical attributes with no null values</u>.

`df.describe()`

- **Purpose**: Generates a summary of statistics for numerical columns in the DataFrame, including count, mean, standard deviation, min, max, and the quartile values (25%, 50%, and 75%). 
- **Parameters**:
  - `percentiles` (optional): A list of percentiles to include in the output. Defaults to `[0.25, 0.5, 0.75]`.
  - `include` (optional): Specifies the data types to include in the summary. Can be `None` (default), `all`, or a list of data types.
  - `exclude` (optional): Specifies the data types to exclude from the summary.
  - `datetime_is_numeric` (optional): Whether to treat datetime data as numeric when calculating statistics. Defaults to `False`.
- **Usage Example**:
  ```python
  df.describe()  # Returns descriptive statistics for numerical columns WITH NO NULL VALUES 
  df.describe(include='all')  # Returns descriptive statistics for all columns


In [9]:
df.describe() # Numerical columns with no null values

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [10]:
df.describe(include = 'all') # All numerical columns (even with nulls)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


## `.unique()`

Pandas provides the `unique()` method to find the unique values in a Series or DataFrame column.

`df['column'].unique()`

- **Purpose**: Returns an array of the unique values in a Series or DataFrame column.
- **Parameters**:
  - None.
- **Usage Example**:
  ```python
  df['column'].unique()  # Returns an array of unique values in the specified column


In [11]:
# Two ways to access columns. Both achieve the same thing (second one is cleaner for single columns)
df['Embarked'].unique()
df.Embarked.unique()


array(['S', 'C', 'Q', nan], dtype=object)

## `.value_counts()`

Pandas provides the `value_counts()` method to count the number of occurrences of each unique value in a Series or DataFrame column.

`df['column'].value_counts()`

- **Purpose**: Returns a Series containing counts of unique values in descending order. 
- **Parameters**:
  - `dropna` (optional): If `True` (default), missing values (`NaN`) are excluded. If `False`, missing values are included in the counts.
  - `normalize` (optional): If `True`, the method returns the relative frequencies of the unique values instead of the raw counts (proportion/%). Defaults to `False`.
  - `sort` (optional): If `True` (default), the counts are sorted in descending order. If `False`, the counts are returned in the order they appear.
  - `ascending` (optional): If `True`, the counts are sorted in ascending order. Defaults to `False`.

- **Usage Example**:
  ```python
  df['column'].value_counts()  # Returns a Series with counts of unique values in descending order
  df['column'].value_counts(normalize=True)  # Returns the relative frequencies of unique values


In [12]:
df.Pclass.value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [13]:
df.Pclass.value_counts(normalize=True)

Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

In [14]:
# Return the number of times the number $26 appears in "Fare"
fare_count = df.Fare.value_counts()

# Get the specific value of interest
specific_value = 26
fare_count.get(specific_value, 0)

31

## `.get()`

Pandas provides the `get()` method to retrieve values from a Series or DataFrame based on labels or keys.

### For DataFrames

**Purpose**: While DataFrames do not have a direct `get()` method for row selection, boolean indexing can be used to achieve similar results by filtering rows based on a condition.

**Parameters**:
  - `condition`: A boolean condition or filter that specifies which rows to select.

**Usage Example (Data Frames)**:

<u>Remember for dataframes</u>, we don't need to use `.get()`, we can just do indexing. See below: 
```python
df.loc[df['Name'] == 'BoB']  # Retrieves rows where the 'Name' column has the value 'BoB'
```

**Usage Example (Series)**:

<u>For series</u>, we can use `.get()`.
```python
series_name.get(key_value, value_if_not_found)
```


In [15]:
# Create a DataFrame
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 45]
})

# Find the row where the ID is 3
df1[df1['ID'] == 3]


Unnamed: 0,ID,Name,Age
2,3,Charlie,35


In [16]:
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s)

# Get the value of "b"
s.get("b", "Value not found")


a    10
b    20
c    30
dtype: int64


20

## `.isin()`

Pandas provides the `isin()` method to check if each element in a DataFrame or Series is contained in a specified set of values.

`df['column'].isin(values)`

- **Purpose**: Returns a boolean Series indicating whether each element in the Series or DataFrame column is contained in the given list, set, or other iterable of values.
- **Parameters**:
  - `values`: A <u>list, set, or array-like object</u> containing the values to check against. It can also be a Series or DataFrame.
- **Usage Example**:
  ```python
  df['column'].isin([value1, value2, value3])  # Returns a boolean Series indicating membership


In [17]:
df.Sex.isin(['male'])

0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Sex, Length: 891, dtype: bool

---
---

# Missing Data
The following commands can used to identify and deal with missing data.

---
---

## `.isna()`

Pandas provides the `isna()` method to detect missing values (NaN) in a DataFrame or Series.

`df.isna()`

- **Purpose**: Returns a DataFrame or Series of the same shape as the original, with boolean values indicating where each element is missing (NaN). 
- **Parameters**:
  - None. `isna()` is a method that does not take any arguments.
- **Usage Example**:
  ```python
  df.isna()  # Returns a DataFrame or Series of boolean values indicating missing data


In [18]:
example = pd.DataFrame({
    "Cabin": df.Cabin,
    "isna()": df.Cabin.isna()
    })

example

Unnamed: 0,Cabin,isna()
0,,True
1,C85,False
2,,True
3,C123,False
4,,True
...,...,...
886,,True
887,B42,False
888,,True
889,C148,False


## `.notna()`

Pandas provides the `notna()` method to detect non-missing values (not NaN) in a DataFrame or Series.

`df.notna()`

- **Purpose**: Returns a DataFrame or Series of the same shape as the original, with `True` for non-missing values and `False` for missing values (NaN). This is useful for identifying where data is present in your dataset.
- **Parameters**: 
  - There are no specific parameters for `notna()`, as it operates directly on the DataFrame or Series.
  
- **Usage Example**:
  ```python
  df.notna()  # Returns a DataFrame with True for non-NaN values and False for NaN values

In [19]:
example1 = pd.DataFrame({
    "Cabin": df.Cabin,
    "isna()": df.Cabin.notna()
    })

example1

Unnamed: 0,Cabin,isna()
0,,False
1,C85,True
2,,False
3,C123,True
4,,False
...,...,...
886,,False
887,B42,True
888,,False
889,C148,True


## `.dropna()`

Pandas provides the `dropna()` method to remove missing values (NaN) from a DataFrame or Series.

`df.dropna()`

- **Purpose**: Removes rows or columns with missing values from the DataFrame or Series. By default drops any rows with NA values.
- **Parameters**:
  - `axis` (optional): Specifies whether to drop rows (`axis=0`, default) or columns (`axis=1`).
  - `how` (optional): Determines which rows or columns to drop based on missing values. 
    - `'any'`: Drop if any missing values are present.
    - `'all'`: Drop if all values are missing.
  - `thresh` (optional): An integer value specifying the minimum number of non-NA values required to keep the row or column.
  - `subset` (optional): Specifies a subset of columns or rows to consider for missing values.
  - `inplace` (optional): If `True`, performs the operation in-place without returning a new DataFrame. Defaults to `False`.
- **Usage Example**:
  ```python
  df.dropna()  # Removes rows with any missing values
  df.dropna(axis=1)  # Removes columns with any missing values
  df.dropna(how='all')  # Removes rows where all values are missing
  df.dropna(thresh=2)  # Removes rows with fewer than 2 non-null values
  df.dropna(subset=['A', 'B'])  # Removes rows where any of the specified columns have missing values
  df.dropna(inplace=True)  # Removes rows with any missing values and modifies the original DataFrame


In [20]:
df.dropna(subset = ["Age"], inplace = True) # Remove rows where "Age" has NA values  
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## `.fillna()`

Pandas provides the `fillna()` method to fill missing values (NaN) in a DataFrame or Series with specified values.

`df.fillna(value=None)`

- **Purpose**: Replaces missing values with a specified value, method, or forward/backward fill. This is useful for handling missing data by imputing it with specific values or by propagating existing values.
- **Parameters**:
  - `value` (optional): The value(s) to use for filling missing values. It can be a scalar, dictionary, Series, or DataFrame.
  - `method` (optional): The method to use for filling missing values:
    - `'ffill'` or `'pad'`: Forward fill. Propagates the last valid value forward.
    - `'bfill'` or `'backfill'`: Backward fill. Propagates the next valid value backward.
  - `axis` (optional): Specifies the axis to fill (0 for rows, 1 for columns). Default is `None`.
  - `inplace` (optional): If `True`, modifies the original DataFrame or Series in place. Defaults to `False`.
  - `limit` (optional): The maximum number of missing values to fill. Default is `None`, which means no limit.
- **Usage Example**:
  ```python
  df.fillna(value=0)  # Replaces all missing values with 0
  df.fillna(value={'A': 0, 'B': 1})  # Fills missing values with 0 in column 'A' and 1 in column 'B'


In [21]:
# Fills missing value with "No Cabin Identified" for column "Cabin" and...
df.fillna(value = {"Cabin": "No Cabin Identified", "Embarked": "Not Recorded"}, inplace = True) 
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


---
---

# Selecting Data
The following commands can used select data.

---
---

## `.iloc[]`

Pandas provides the `iloc[]` method for integer-location based indexing for selection by position in a DataFrame or Series.

`df.iloc[rows, columns]`

- **Purpose**: Allows for selection and filtering of rows and columns based on their integer position. 
- **Parameters**:
  - `rows`: The integer index or slice for selecting rows.
  - `columns`: The integer index or slice for selecting columns. Can be omitted if selecting only rows.
- **Usage Example**:
  ```python
  df.iloc[0]  # Returns the first row as a SERIES 
  df.iloc[[0]]  # Returns the first row as a DATA FRAME
  df.iloc[ : , [1]]  # Returns the second column as a Series
  df.iloc[0:3, 1:3]  # Returns a subset of the DataFrame from rows 0 to 2 and columns 1 to 2 (upper bound is not inclusive)
  df.iloc[[0, 2], [1, 2]]  # Returns the rows at index 0 and 2 and columns at index 1 and 2


In [22]:
# Return the first 10 rows, showing columns 0 (PassengerId), 3 (Name) and 6 (SibSp)
df.iloc[0:10, [0,3,6]] 

Unnamed: 0,PassengerId,Name,SibSp
0,1,"Braund, Mr. Owen Harris",1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,3,"Heikkinen, Miss. Laina",0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,5,"Allen, Mr. William Henry",0
6,7,"McCarthy, Mr. Timothy J",0
7,8,"Palsson, Master. Gosta Leonard",3
8,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",0
9,10,"Nasser, Mrs. Nicholas (Adele Achem)",1
10,11,"Sandstrom, Miss. Marguerite Rut",1


## `.loc[]`

Pandas provides the `loc[]` method for label-based indexing to select rows and columns by their labels.

`df.loc[rows, columns]`

- **Purpose**: Allows for selection and filtering of rows and columns based on their labels. 
- **Parameters**:
  - `rows`: The label(s) for selecting rows. Can be a single label, a list of labels, or a slice.
  - `columns`: The label(s) for selecting columns. Can be a single label, a list of labels, or a slice. Can be omitted if selecting only rows.
- **Usage Example**:
  ```python
  df.loc['row_label']  # Returns a SERIES with the data for the specified row label (will be the index by default, less "set_index" is used)
  df.loc[['row_label']] # Returns a DATA FRAME with the data for the specified row label (will be the index by default, less "set_index" is used)
  df.loc[ : , ['column_label']]  # Returns a Series with the data for the specified column label
  df.loc['row1':'row3', 'col1':'col3']  # Returns a subset of the DataFrame from rows 'row1' to 'row3' and columns 'col1' to 'col3'
  df.loc[['row1', 'row3'], ['col2', 'col3']]  # Returns the data for specified rows and columns


In [23]:
# Returns the first 4 passengers with columns PassengerId, Name and Age (in that order)
df.loc[0:3, ["PassengerId", "Name", "Age"]]

Unnamed: 0,PassengerId,Name,Age
0,1,"Braund, Mr. Owen Harris",22.0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,3,"Heikkinen, Miss. Laina",26.0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0


## `.iat[]` (faster iloc for single values)

Pandas provides the `iat[]` method for integer-based scalar access to select a single value from a DataFrame by its row and column positions.

`df.iat[row_index, column_index]`

- **Purpose**: Allows for fast access to a single value in a DataFrame using the integer indices of the row and column. It is more efficient than `.iloc[]` for retrieving single values.
- **Parameters**:
  - `row_index`: The integer position of the row you want to access (0-based index).
  - `column_index`: The integer position of the column you want to access (0-based index).
  
- **Usage Example**:
  ```python
  value = df.iat[1, 3]  # Returns the value at the second row and fourth column


In [52]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,Third Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,No Cabin Identified,S,Young
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,Third Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,No Cabin Identified,S,Middle-Aged
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Middle-Aged
4,5,0,Third Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,No Cabin Identified,S,Middle-Aged


In [55]:
# Get the "Name" of the second passenger (0-indexed)
df.iat[1,3]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

## `.at[]` (faster loc for single values)

Pandas provides the `at[]` method for label-based scalar access to select a single value from a DataFrame by its row and column labels.

`df.at[row_label, column_label]`

- **Purpose**: Allows for fast access to a single value in a DataFrame using the labels of the row and column. It is more efficient than `.loc[]` for retrieving single values.
- **Parameters**:
  - `row_label`: The label of the row you want to access.
  - `column_label`: The label of the column you want to access.
  
- **Usage Example**:
  ```python
  value = df.at['row_label', 'column_label']  # Returns the value at the specified row and column labels


In [57]:
# Get the "Name" of the second passenger (0-indexed)
df.at[2,"Name"]

'Heikkinen, Miss. Laina'

---
---

# Data Transformation
The following commands can used transform data such as replacing values, adding in new values or removing values.

---
---

## `.apply()`

Pandas provides the `apply()` method to apply a function along an axis of the DataFrame or Series. This is useful for performing operations or transformations on data.

`df.apply(func, axis=0)`

- **Purpose**: Applies a function along the specified axis of the DataFrame or Series. This can be used to perform complex calculations or transformations on your data.
- **Parameters**:
  - `func`: The function to apply. This can be a user-defined function, a lambda function, or a built-in function.
  - `axis` (optional): Specifies whether to apply the function to rows (`axis=0`, default) or columns (`axis=1`).
  - `result_type` (optional): Determines the format of the result (only applicable when `axis=1`). Options are `expand`, `reduce`, or `broadcast`.
- **Usage Example**:
  ```python
  df.apply(lambda x: x + 1)  # Adds 1 to every element in the DataFrame
  df.apply(lambda x: x.mean(), axis=0)  # Computes the mean of each column
  df.apply(lambda x: x.max() - x.min(), axis=1)  # Computes the range (max - min) for each row



In [24]:
# Add 1 to every value in the column "Age" and "Fare" 
df[['Age', 'Fare']].apply(lambda x: x+1)

Unnamed: 0,Age,Fare
0,23.0,8.2500
1,39.0,72.2833
2,27.0,8.9250
3,36.0,54.1000
4,36.0,9.0500
...,...,...
885,40.0,30.1250
886,28.0,14.0000
887,20.0,31.0000
889,27.0,31.0000


In the example below, we create a new row that classifies whether a person is Old or Young.

In [25]:
def replacement(row):
    if row < 25: 
        return "Young"
    elif 25 <= row < 40:
        return "Middle-Aged"
    else:
        return "Young"
    
df['Age Classification'] = df["Age"].apply(replacement)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Young
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


## `replace()`

Pandas provides the `replace()` method to replace specified values with other values within a DataFrame or Series. values.

`df.replace(to_replace, value=None, inplace=False, limit=None, regex=False)`

- **Purpose**: Replaces occurrences of `to_replace` with `value`. This can be used to clean or transform data by substituting specific values.
- **Parameters**:
  - `to_replace`: The value or pattern to be replaced. This can be a scalar, list, dictionary, or regex pattern.
  - `value` (optional): The value to replace `to_replace` with. Can be a scalar, list, or dictionary. If not specified, `to_replace` will be replaced with `None`.
  - `inplace` (optional): If `True`, performs the operation in place and modifies the DataFrame/Series directly. Defaults to `False`.
  - `limit` (optional): The maximum number of replacements to make. If not specified, all occurrences are replaced.
  - `regex` (optional): If `True`, treats `to_replace` as a regex pattern. Defaults to `False`.
- **Usage Example**:
  ```python
  df.replace(2, 10)  # Replaces all occurrences of 2 with 10 in the DataFrame
  df["column_name"].replace(replacement_dictionary)  # Replace values in column "column_name" based on a dictionary 

  df["Age"].replace(50, "Mid-Life") # Replace 50 with "Mid-Life" in the "Age" column only 
  df.replace({'Age': 50}, {'Age': "Mid-Life"})  # Replace 50 with "Mid-Life" in the "Age" column only 


In [26]:
# Define a replacement dictionary 
replacement_dictionary = {
    1: "First Class",
    2: "Second Class",
    3: "Third Class"
}

# Replace the 1, 2, 3 in "Pclass" with First Class, Second Class and Third Class
df.Pclass = df.Pclass.replace(replacement_dictionary)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,Third Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,Third Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,Third Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,Third Class,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
886,887,0,Second Class,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
887,888,1,First Class,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Young
889,890,1,First Class,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


## `melt()`

Pandas provides the `melt()` method to unpivot or transform a DataFrame from a wide format to a long format.

`df.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`

- **Purpose**: Converts columns of a DataFrame into rows. This is helpful for normalizing data or for preparing data for certain types of analyses or visualizations.
- **Parameters**:
  - `id_vars` (optional): Columns to set as identifier variables. These columns will remain unchanged in the output.
  - `value_vars` (optional): Columns to unpivot into rows. If not specified, all columns not included in `id_vars` will be used.
  - `var_name` (optional): Name to use for the new column that will contain the former column names. Defaults to `'variable'`.
  - `value_name` (optional): Name to use for the new column that will contain the values from the original columns. Defaults to `'value'`.
  - `col_level` (optional): If columns have multiple levels (MultiIndex), specifies which level to use for unpivoting.



**What is WIDE vs LONG formatted data?**
- <u> Wide data (more common) </U>: each individual individual occupies their own row, and each of their variables occupy a single column.
  - Considered "people friendly" as it is easy to read and interpret (all information about an indivdual is available at a single glance)
  - An easy way to identify wide data, is that the first column tends not to repeat 
- <u> Long data (desired) </U>: long data allows for multiple rows for each entity, and instead records new attributes or observations as a new row in the dataset.
  - Considered "machine friendly" as allows you to perform grouping and aggregation on it 
  - Adding new data (such as another Category of Sport) is easier as you can add a new row (Sport, Value) instead of having to create a new column and potentially have NA for previous row 
  

  


In [27]:
# Create the WIDE DataFrame
df2 = pd.DataFrame({
    'Date': ['2024-08-01', '2024-08-02', '2024-08-03'],
    'Product_A': [10, 20, 30],
    'Product_B': [15, 25, 35]
})
df2


Unnamed: 0,Date,Product_A,Product_B
0,2024-08-01,10,15
1,2024-08-02,20,25
2,2024-08-03,30,35


Above is the original WIDE data frame. Below is the melted LONG data frame.

In [28]:
# Melt the DataFrame
df_melted = df2.melt(id_vars='Date', value_vars=['Product_A', 'Product_B'], var_name='Product', value_name='Sales')
df_melted

Unnamed: 0,Date,Product,Sales
0,2024-08-01,Product_A,10
1,2024-08-02,Product_A,20
2,2024-08-03,Product_A,30
3,2024-08-01,Product_B,15
4,2024-08-02,Product_B,25
5,2024-08-03,Product_B,35


---
---

# Data Frame Manipulation
The following commands can used to modify the structure of the data frame.

---
---

## `pd.DataFrame()`

Pandas provides the `pd.DataFrame()` constructor to create a DataFrame object from various data sources. 

`pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)`

- **Purpose**: Creates a DataFrame object from different types of input data, such as dictionaries, lists, or arrays.
- **Parameters**:
  - `data` (optional): Data to populate the DataFrame. This can be a dictionary, list, array, or other data structures. 
  - `index` (optional): Sequence to use as the row labels of the DataFrame. Defaults to `None`, which means an integer index is used.
  - `columns` (optional): Sequence to use as the column labels of the DataFrame. Defaults to `None`, which means columns are auto-generated.
  - `dtype` (optional): Data type to force the DataFrame to. If not specified, the data types are inferred from the data.
  - `copy` (optional): Whether to copy the data. Defaults to `False`. If `True`, a copy of the data is made.

- **Usage Example**:
  ```python
  import pandas as pd

  # Create a DataFrame from a dictionary
  df_dict = pd.DataFrame({
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'City': ['New York', 'Los Angeles', 'Chicago']
  })

  # Create a DataFrame from a list of lists
  df_list = pd.DataFrame([
      [1, 'Alice', 24],
      [2, 'Bob', 30],
      [3, 'Charlie', 35]
  ], columns=['ID', 'Name', 'Age'])

  # Create a DataFrame with custom index and column names
  df_custom = pd.DataFrame(
      data=[[10, 20, 30], [40, 50, 60]],
      index=['Row1', 'Row2'],
      columns=['Column1', 'Column2', 'Column3']
  )


## `Adding a Column`

Pandas provides several methods for adding a new column to a DataFrame. This is useful for augmenting the DataFrame with additional data or calculated values.
- `Direct Assignment`: useful when adding few amount of columns. Modifies it in place.
- `.assign()`: best for adding multiple columns 
- `.insert()`: best for adding columns in a specific position

For the examples below, lets first create a dataframe to work with: 
```python 
# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
```


### Method 1: Direct Assignment

You can directly assign a new column to a DataFrame by specifying the column name and its values.

```python
# Add a new column with a constant value (will create a new column "Country" with value "USA" for all)
df['Country'] = 'USA'

# Add a new column with different values
df['Score'] = [85, 90, 95]

# OUTPUT 
       Name  Age Country  Score
0     Alice   25     USA     85
1       Bob   30     USA     90
2   Charlie   35     USA     95
```

### Method 2: Using `.assign()`

You can use the assign() method to add one or more new columns to the DataFrame.

```python
# Add a new column using assign
df = df.assign(Country='USA', Score=[85, 90, 95])

# OUTPUT 
       Name  Age Country  Score
0     Alice   25     USA     85
1       Bob   30     USA     90
2   Charlie   35     USA     95
```

### Method 3: Using `.insert()`

You can use the insert() method to add a new column at a specific position in the DataFrame.

```python 
# Add a new column at the second position
df.insert(1, 'Score', [85, 90, 95])


# OUTPUT (notice we inserted it between the two existing columns)
       Name  Score  Age
0     Alice     85   25
1       Bob     90   30
2   Charlie     95   35
```

## `Adding a Row`

Pandas provides several methods for adding a new row to a DataFrame. 
- `loc[]`: if you specify a location within the no. rows, it will override it. So you must always use len(df) at that point .append() is better.
- `.append()`: used most often for adding simple observations to a data frame
- `.concat()`: used for more complex appending like appending dataframes together (but then you would use merge)

For the examples below, let’s first create a DataFrame to work with:

```python
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
```

### Method 1: Using `.loc[]`/`iloc[]`

You can use the .loc indexer to add a new row by specifying a new index label.

```python
# Add a new row using loc. When only one value is passed in .loc[], its a row location
df.loc[len(df)] = ['David', 40]

# OUTPUT 
       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35
3     David   40
```

### Method 2: Using `.append()`

You can use the .append() method to add a new row to the DataFrame. Note that append() returns a new DataFrame with the added row.

```python
# Create new row to append
new_row = pd.DataFrame({'Name': ['David'], 'Age': [40]})

# Append the new row 
df = df.append(new_row, ignore_index=True) # "ignore_index = True" will reset the index

# OUTPUT 
       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35
3     David   40
```

### Method 3: Using `.concat()`

You can use the .concat() function to concatenate the existing DataFrame with a new DataFrame containing the row to be added.

```python 
new_row = pd.DataFrame({'Name': ['David'], 'Age': [40]})

# Add the new row using concat
df = pd.concat([df, new_row], ignore_index=True)

# OUTPUT
       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35
3     David   40
```

## `Removing a Column`

Pandas provides several methods for removing columns in a DataFrame. 
- `.drop()`: remvoes a specific column. Very similar to .del however you can drop <u>MULTIPLE columns</u>.
- `.pop()`: pops the dataframe --> deletes it and returns it in a variable as a series.
- `.del`: probably the best, just deletes it. Can only remove <u>ONE column at a time</u>.

For the examples below, let’s first create a DataFrame to work with:

```python
# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Country': ['USA', 'UK', 'Canada']
})
```

### Method 1: Using .drop()

You can use the .drop() method to remove columns by specifying the column names and setting axis=1.

```python
# Remove a single column
df = df.drop('Country', axis=1)

# Below would drop "Country" and "State"
df = df.drop(['Country', 'State'], axis=1)

# OUTPUT 
       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35
```

### Method 2: Using `.pop()`

The .pop() method removes a column from a DataFrame and returns it as a Series. This method also modifies the DataFrame in place.

```python
# Remove and return a column
country_col = df.pop('Country') # Can now access the removed column as a series through "country_col"

# OUTPUT 
       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35
```

### Method 3: Using `.del`

You can also use the del statement to remove a column from the DataFrame. This approach directly deletes the column.

```python 
del df['Country']

# OUTPUT
       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35
```

## `Removing a Row`

Pandas provides several methods for removing a new row in a DataFrame. 
- `.drop()`: when you need to remove a specific observation
- `conditional filtering`: when you want to retain specific rows depending on a condition
- `indexing with .iloc[]`: when you want to retain a specific range of rows and columns


For the examples below, let’s first create a DataFrame to work with:

```python
# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
```

### Method 1: Using `.drop()`

When you want to remove rows by specifying their index labels. This method is flexible and allows you to remove multiple rows at once.

```python
# Remove row with index 1
df_dropped = df.drop(1)

# Remove row with index 1 and 2
df = df.drop([1,2])

# OUTPUT 
       Name  Age
0     Alice   25
```

### Method 2: `Conditional Filtering`

When you need to remove rows based on a condition that affects multiple rows. Use boolean indexing to filter out rows that do not meet the condition.

```python
df = df[df['Age'] <= 30]

# OUTPUT 
       Name  Age
0     Alice   25
1       Bob   30
```

### Method 3: Using `.iloc[]`

When you want to remove a range of rows based on their integer position.

```python 
df = df.drop(df.index[0:2])  # Drops rows at index 1 and 2

# OUTPUT
       Name  Age
1       Bob   30
```

## `Reordering Columns`
There is one main way of reordering which is just re-indexing the columns. 

```python
# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Country': ['USA', 'UK', 'Canada']
})

# Reorder columns
df = df[['Name', 'Country', 'Age']]  # Specify new order of columns

# OUTPUT
       Name Country  Age
0     Alice     USA   25
1       Bob      UK   30
2   Charlie  Canada   35
```

## `Renaming Columns`
There is one two way of reordering which is just re-indexing the columns. 
- `Direct Assignment`: useful for just <u>OVERRIDING column names</u> with new ones 
- `.rename()`: useful when you need to <u>REPLACE column names</u> 

```python
# Create the DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Country': ['USA', 'UK', 'Canada']
})
```
### Method 1: `Direct Assignment`
```python
df.columns = ['Full Name', 'Age (Years)', 'Country']

# OUTPUT
  Full Name  Age (Years) Country
0     Alice           25     USA
1       Bob           30      UK
2   Charlie           35  Canada
```

### Method 2: `.rename()`

```python
# Rename columns using rename
df = df.rename(columns={'Name': 'Full Name', 'Age': 'Age (Years)'})

# OUTPUT
  Full Name  Age (Years) Country
0     Alice           25     USA
1       Bob           30      UK
2   Charlie           35  Canada
```

## `set_index()`

Pandas provides the `set_index()` method to set one or more columns as the index of a DataFrame. 

`df.set_index(keys, inplace=True)`

- **Purpose**: Sets the specified column(s) as the index of the DataFrame. This helps in organizing data and enables efficient data retrieval and alignment.
- **Parameters**:
  - `keys`: The column(s) to be set as the new index. Can be a single column name or a list of column names.
  - `drop` (optional): If `True` (default), the column(s) used as the index will be removed from the DataFrame. If `False`, the columns will be retained.
  - `append` (optional): If `True`, adds the new index columns to the existing index. If `False` (default), replaces the existing index.
  - `inplace` (optional): If `True`, modifies the DataFrame in place without returning a new DataFrame. Defaults to `False`.

- **Usage Example**:
  ```python
  # Create the DataFrame
  df = pd.DataFrame({
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Country': ['USA', 'UK', 'Canada']
  })

  # Set 'Name' column as the index
  df = df.set_index('Name')

             Age Country
  Name                   
  Alice       25     USA
  Bob         30      UK
  Charlie     35  Canada
  ```



In [29]:
# Set the passengerId as the new index 
df.set_index("PassengerId", inplace = True)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,Third Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
3,1,Third Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
5,0,Third Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,Third Class,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
887,0,Second Class,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
888,1,First Class,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Young
890,1,First Class,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


## `reset_index()`

Pandas provides the `reset_index()` method to reset the index of a DataFrame. 

`df.reset_index(inplace = True)`

- **Purpose**: Resets the index of the DataFrame, optionally converting the current index into columns. This is useful for flattening hierarchical indexes or removing an existing index.
- **Parameters**:
  - `level` (optional): Specifies which levels of a multi-index to reset. If not specified, all levels are reset.
  - `drop` (optional): If `True`, the index is reset but not added as a column in the DataFrame. Defaults to `False`.
  - `inplace` (optional): If `True`, modifies the DataFrame in place without returning a new DataFrame. Defaults to `False`.
  - `col_level` (optional): For a MultiIndex, specifies which level the labels are inserted into. Defaults to `0`.
  - `col_fill` (optional): For a MultiIndex, specifies the label name to use if a new column is added to a partially labeled DataFrame. Defaults to an empty string.

- **Usage Example**:
  ```python
  # Create the DataFrame with an index
  df = pd.DataFrame({
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Country': ['USA', 'UK', 'Canada']
  }).set_index('Name')

  # Reset the index and convert it back to a column
  df_reset = df.reset_index()
  


In [30]:
df.reset_index(inplace = True) # If you just print out 
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,Third Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,Third Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,Third Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,886,0,Third Class,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
710,887,0,Second Class,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
711,888,1,First Class,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Young
712,890,1,First Class,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


---
---

# Sorting and Ordering
The following commands can used to sort and order the data frames.

---
---

## `.sort_values()`

Pandas provides the `sort_values()` method to sort a DataFrame by the values of one or more columns.

`df.sort_values(by='column_name', ascending=True, inplace=True)`

- **Purpose**: Sorts the DataFrame based on the values in one or more specified columns. 

- **Parameters**:
  - `by` (required): Specifies the column or list of columns to sort by. If sorting by multiple columns, the order of the columns in the list determines the sort priority.
  - `ascending` (optional): Determines the sort order. If `True`, sorts in ascending order. If `False`, sorts in descending order. Can also be a list to specify the order for each column when sorting by multiple columns. Defaults to `True`.
  - `inplace` (optional): If `True`, performs the sort operation in place, modifying the existing DataFrame. If `False`, returns a new DataFrame with the sorted values. Defaults to `False`.
  - `na_position` (optional): Specifies the placement of NaNs in the sorted data. `'first'` places NaNs at the beginning, while `'last'` places them at the end. Defaults to `'last'`.
  - `ignore_index` (optional): If `True`, the resulting DataFrame will have a new integer index from 0 to `n-1`, ignoring the original index. Defaults to `False`.
  - `key` (optional): A function applied to each column before sorting. Useful for custom sorting, like sorting strings in a case-insensitive manner.

- **Usage Example**:
  ```python
  # Create the DataFrame
  df = pd.DataFrame({
      'Name': ['Alice', 'Bob', 'Charlie'],
      'Age': [25, 30, 35],
      'Country': ['USA', 'UK', 'Canada']
  })

  # Sort by Age in ascending order
  df_sorted = df.sort_values(by='Age')
  
  # Sort by Country in descending order (ALPHANUMERIC ORDERING)
  df_sorted_desc = df.sort_values(by='Country', ascending=False)
  
  # Sort by Age and then by Country
  df_sorted_multi = df.sort_values(by=['Age', 'Country'])


In [31]:
# Soting by name (alphanumeric)
df.sort_values(by = "Name")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
675,846,0,Third Class,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.5500,No Cabin Identified,S,Young
593,747,0,Third Class,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.2500,No Cabin Identified,S,Young
224,280,1,Third Class,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.2500,No Cabin Identified,S,Middle-Aged
245,309,0,Second Class,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0000,No Cabin Identified,C,Middle-Aged
699,875,1,Second Class,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,No Cabin Identified,C,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,560,1,Third Class,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1,0,345572,17.4000,No Cabin Identified,S,Middle-Aged
230,287,1,Third Class,"de Mulder, Mr. Theodore",male,30.0,0,0,345774,9.5000,No Cabin Identified,S,Middle-Aged
227,283,0,Third Class,"de Pelsmaeker, Mr. Alfons",male,16.0,0,0,345778,9.5000,No Cabin Identified,S,Young
289,362,0,Second Class,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,No Cabin Identified,C,Middle-Aged


In [32]:
df2 = pd.DataFrame({
    "Name": ["Abby", "Jackson", "Bob", "Abby"],
    "Age": [14, 15, 51, 99]
})

# Sort values by NAME IN ASCENDING ORDER then, if there are duplicate names, sort by AGE IN DESCENDING ORDER 
df2.sort_values(by = ['Name', 'Age'], ascending = [True, False])

Unnamed: 0,Name,Age
3,Abby,99
0,Abby,14
2,Bob,51
1,Jackson,15


## `nlargest()` and `nsmallest()`

Pandas provides the `nlargest()` and `nsmallest()` methods to return the top <u>`n` largest or smallest rows</u> of a DataFrame based on the values in a specific column.

`df.nlargest(n=5, columns='column_name')`

`df.nsmallest(n=5, columns='column_name')`

- **Purpose**: 
  - `nlargest()`: Returns the `n` largest values in a specified column of the DataFrame.
  - `nsmallest()`: Returns the `n` smallest values in a specified column of the DataFrame. 

- **Parameters**:
  - `n` (optional): The number of top or bottom rows to return. Defaults to `5`.
  - `columns` (required): The column name or list of column names to sort by. The function will return rows based on the largest or smallest values in this column.
  - `keep` (optional): Determines which duplicates (if any) to keep. Options are `'first'` (keep the first occurrence), `'last'` (keep the last occurrence), and `False` (drop all duplicates). Defaults to `'first'`.

- **Usage Example**:
  ```python 
  # Create the DataFrame
  df = pd.DataFrame({
      'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
      'Score': [85, 92, 88, 95, 91]
  })

  # Get the top 3 scores
  top_scores = df.nlargest(n=3, columns='Score')

  # Get the bottom 3 scores
  bottom_scores = df.nsmallest(n=3, columns='Score')


In [33]:
# Return the 3 oldest members in the data frame 
df.nlargest(n = 3, columns = 'Age')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
498,631,1,First Class,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S,Young
679,852,0,Third Class,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,No Cabin Identified,S,Young
74,97,0,First Class,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,Young


In [34]:
df.nsmallest(n = 10, columns = 'Fare')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
144,180,0,Third Class,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,No Cabin Identified,S,Middle-Aged
212,264,0,First Class,"Harrison, Mr. William",male,40.0,0,0,112059,0.0,B94,S,Young
218,272,1,Third Class,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,No Cabin Identified,S,Middle-Aged
242,303,0,Third Class,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,No Cabin Identified,S,Young
472,598,0,Third Class,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,No Cabin Identified,S,Young
643,807,0,First Class,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,Middle-Aged
658,823,0,First Class,"Reuchlin, Jonkheer. John George",male,38.0,0,0,19972,0.0,No Cabin Identified,S,Middle-Aged
302,379,0,Third Class,"Betros, Mr. Tannous",male,20.0,0,0,2648,4.0125,No Cabin Identified,C,Young
697,873,0,First Class,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S,Middle-Aged
262,327,0,Third Class,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,No Cabin Identified,S,Young


---
---

# Merging and Joining
The following commands can used to merge and join multiple data frames together.

---
---

## `.merge()`

Pandas provides the `merge()` method to combine two DataFrames based on one or more common columns or indices.

`df1.merge(df2, on='column_name', how='inner')`

- **Purpose**: Combines two DataFrames into a single DataFrame by aligning rows with matching values in specified columns or indices. 

- **Parameters**:
  - `df2` (required): The DataFrame to merge with the calling DataFrame. (known as "right" in the method but "df2" in the example above)
  - `on` (optional): Column or index level names to join on. Must be found in both DataFrames. If not specified, the method will attempt to join on columns with the same name.
    - `left_on` (optional): Use this when the join columns are not named the same. Columns from the left DataFrame to use as keys.
    - `right_on` (optional): Use this when the join columns are not named the same. Columns from the right DataFrame to use as keys.
  - `how` (optional): Type of merge to perform. Options are:
    - `'left'`: Use only keys from the left DataFrame.
    - `'right'`: Use only keys from the right DataFrame.
    - `'outer'`: Use keys from both DataFrames, filling in missing values with NaNs.
    - `'inner'`: Use only keys that are present in both DataFrames. Defaults to `'inner'`.
  - `left_index` (optional): If `True`, use the index from the left DataFrame as the join key.
  - `right_index` (optional): If `True`, use the index from the right DataFrame as the join key.
  - `suffixes` (optional): A tuple of string suffixes to apply to overlapping column names. Defaults to `('_x', '_y')`.

- **Usage Example**:
  ```python
  # Create the first DataFrame
  df1 = pd.DataFrame({
      'StudentID': [1, 2, 3],
      'Name': ['Alice', 'Bob', 'Charlie']
  })

  # Create the second DataFrame
  df2 = pd.DataFrame({
      'ID': [2, 3, 4],
      'Age': [25, 30, 35]
  })

  # Merge the DataFrames on the 'ID' column
  merged_df = df1.merge(df2, left_on = 'StudentID', right_on = 'ID', how='inner')


In [35]:
# Create the first DataFrame with different column names for the key
df1 = pd.DataFrame({
    'Employee_ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

# Create the second DataFrame with a different column name for the key
df2 = pd.DataFrame({
    'Staff_ID': [2, 3, 4],
    'Department': ['HR', 'Engineering', 'Finance']
})

# Merge the DataFrames using 'left_on' and 'right_on' with an outer join
merged_df = df1.merge(df2, left_on='Employee_ID', right_on='Staff_ID', how='outer')

merged_df

Unnamed: 0,Employee_ID,Name,Staff_ID,Department
0,1.0,Alice,,
1,2.0,Bob,2.0,HR
2,3.0,Charlie,3.0,Engineering
3,,,4.0,Finance


## `.concat()`

Pandas provides the `concat()` method to concatenate two or more DataFrames along a particular axis.
- Use when you want to stack dataframes vertically or horizontally without aligning on specific columns or values (which would be merging)

`pd.concat([df1, df2], axis=0, join='outer')`

- **Purpose**: Concatenates DataFrames along a specified axis, either row-wise (default) or column-wise. This is useful for combining datasets that share the same columns or indexes or for appending new data.


- **Parameters**:
  - `objs` (required): A sequence or mapping of DataFrames to concatenate.
  - `axis` (optional): The axis to concatenate along. `0` for row-wise (default) and `1` for column-wise.
  - `join` (optional): Determines how to handle indexes or columns that do not align. Options are:
    - `'inner'`: Use only the intersection of indexes or columns.
    - `'outer'`: Use the union of indexes or columns (default).
  - `ignore_index` (optional): If `True`, the resulting DataFrame will have a new integer index from 0 to `n-1`, ignoring the original indexes. Defaults to `False`.
  - `keys` (optional): If specified, creates a hierarchical index with these keys.
  - `levels` (optional): Specifies the levels for the hierarchical index.
  - `names` (optional): Names for the levels of the hierarchical index.

- **Usage Example**:
  ```python
  # Create the first DataFrame
  df1 = pd.DataFrame({
      'ID': [1, 2, 3],
      'Name': ['Alice', 'Bob', 'Charlie']
  })

  # Create the second DataFrame
  df2 = pd.DataFrame({
      'ID': [4, 5],
      'Name': ['David', 'Eve']
  })

  # Concatenate the DataFrames row-wise
  concatenated_df = pd.concat([df1, df2], axis=0)


In [36]:
""" Exammple of VERTICAL Concatentation"""

# Create the first DataFrame
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

# Create the second DataFrame with the same columns
df2 = pd.DataFrame({
    'ID': [4, 5],
    'Name': ['David', 'Eve']
})

# Concatenate DataFrames vertically. 
vertical_concat = pd.concat([df1, df2], axis=0, ignore_index=True)

vertical_concat

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie
3,4,David
4,5,Eve


If you remove "ignore_index = True", the indexes WILL NOT get reset and that is not desired.

In [37]:
pd.concat([df1, df2], axis=0, ignore_index=False)

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
2,3,Charlie
0,4,David
1,5,Eve


In [38]:
# Create the first DataFrame
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

# Create the second DataFrame with different columns
df2 = pd.DataFrame({
    'Age': [25, 30, 35],
    'Department': ['HR', 'Engineering', 'Finance']
})

# Concatenate DataFrames horizontally
horizontal_concat = pd.concat([df1, df2], axis=1)

horizontal_concat

Unnamed: 0,ID,Name,Age,Department
0,1,Alice,25,HR
1,2,Bob,30,Engineering
2,3,Charlie,35,Finance


---
---

# Grouping and Aggregation
The following commands can used to group by columns and find statistics about the groups such as mean, median, max or other functions.

---
---

## `.groupby()`

Pandas provides the `groupby()` method to group DataFrame rows based on the values in one or more columns. This is useful for performing aggregate operations on subsets of the data.

`df.groupby('column_name')`

- **Purpose**: Groups DataFrame rows based on specified column(s) and allows for aggregation or transformation operations on each group.

- **Parameters**:
  - `by` (required): The column(s) or index level(s) to group by.
  - `axis` (optional): The axis to group along. Defaults to `0` (rows).
  - `as_index` (optional): If `True` (default), the group labels are used as the index of the resulting DataFrame. If `False`, the group labels are returned as columns.

- **Aggregration Functions**: 
After aggregating using group by, we can apply aggregation functions to them. These include: 
- `.mean()`
- `.median()`
- `.min()`
- `.max()`
- `.std()`
- `.count()`
- `.sum()`
- `.apply()`: remember to pass in "include_groups = False"
- `.nunique()`: different to unique() as it provides a singular number of the number of unqiue values instead of a series.
- `.agg()`
  - `lambda x:`: use for simply functions like max() - min() or proportion
  - Aggregating with multiple functions: for example --> `.agg({"output_col_name": ['mean', 'sum', 'max']})`

### lambda x: Example 

In [39]:
# Create the DataFrame
df2 = pd.DataFrame({
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Sales': [200, 150, 300, 400, 500, 600]
})

# Group by 'Region' and apply a custom aggregation function using lambda
# Calculate the range (max - min) of sales for each region
df2.groupby('Region').agg(lambda x: x.max() - x.min())

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,100
North,50
South,100


In [40]:
# Create the DataFrame
df2 = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Value': [10, 20, 30, 40, 50, 60]
})

# Define a lambda function to count the number of values greater than 25
df2.groupby('Category').agg(lambda x: x[x > 25].count()) 

Unnamed: 0_level_0,Value
Category,Unnamed: 1_level_1
A,0
B,2
C,2


In [41]:
# Create the DataFrame
df2 = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Value': [10, 20, 30, 30, 50, 50]
})

# Group by 'Category' and count unique values in 'Value'
df2.groupby('Category')['Value'].agg(lambda x: x.nunique())


Category
A    2
B    1
C    1
Name: Value, dtype: int64

### .apply() Example

In [42]:
# Create the DataFrame
df1 = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
    'Value': [10, 20, 30, 40, 50, 60],
    'Weight': [1, 2, 1, 3, 2, 1]
})

# Define a custom weighted average function
def weighted_avg(row):
    return (row['Value'] * row['Weight']).sum() / row['Weight'].sum()

# Group by 'Category' and apply the custom weighted average function
df1.groupby('Category').apply(weighted_avg, include_groups = False) # "include_group = False" is required


TypeError: weighted_avg() got an unexpected keyword argument 'include_groups'

---
---

# String Manipulation
The following commands can used to merge and join multiple data frames together.

---
---

## `str.lower()`, `str.upper()`, and `str.title()`

Pandas provides the `str` accessor to perform vectorized string operations on Series or DataFrame columns of string types. Three common methods are `str.lower()`, `str.upper()`, and `str.title()` for changing the case of string data.

`str.lower()`
- **Purpose**: Converts all characters in each string of the Series or DataFrame column to lowercase. 


`str.upper()`
- **Purpose**: Converts all characters in each string of the Series or DataFrame column to uppercase. 

`str.title()`
- **Purpose**: Converts the first character of each word to uppercase and all other characters to lowercase in each string of the Series or DataFrame column. 

In [193]:
# Make "Sex" uppercase 
df.Sex = df.Sex.str.upper()
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,Third Class,"Braund, Mr. Owen Harris",MALE,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,Third Class,"Heikkinen, Miss. Laina",FEMALE,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",FEMALE,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,Third Class,"Allen, Mr. William Henry",MALE,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,886,0,Third Class,"Rice, Mrs. William (Margaret Norton)",FEMALE,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
710,887,0,Second Class,"Montvila, Rev. Juozas",MALE,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
711,888,1,First Class,"Graham, Miss. Margaret Edith",FEMALE,19.0,0,0,112053,30.0000,B42,S,Young
712,890,1,First Class,"Behr, Mr. Karl Howell",MALE,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


In [194]:
# Make name "lower" case 
df.Name = df.Name.str.lower()
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,Third Class,"braund, mr. owen harris",MALE,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,First Class,"cumings, mrs. john bradley (florence briggs th...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,Third Class,"heikkinen, miss. laina",FEMALE,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,First Class,"futrelle, mrs. jacques heath (lily may peel)",FEMALE,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,Third Class,"allen, mr. william henry",MALE,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,886,0,Third Class,"rice, mrs. william (margaret norton)",FEMALE,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
710,887,0,Second Class,"montvila, rev. juozas",MALE,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
711,888,1,First Class,"graham, miss. margaret edith",FEMALE,19.0,0,0,112053,30.0000,B42,S,Young
712,890,1,First Class,"behr, mr. karl howell",MALE,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


## `str.strip()`, `str.lstrip()`, and `str.rstrip()`

Pandas provides the `str` accessor to perform vectorized string operations on Series or DataFrame columns of string types. The methods `str.strip()`, `str.lstrip()`, and `str.rstrip()` are used to remove leading and trailing whitespace from strings, and they can also remove specified characters.

`str.strip()`

- **Purpose**: Removes leading and trailing whitespace (or specified characters) from each string in the Series or DataFrame column. 

- **Parameters**:
  - `chars` (optional): A string specifying the set of characters to be removed. If not provided, it defaults to removing whitespace.


`str.lstrip()`

- **Purpose**: Removes leading (left) whitespace (or specified characters) from each string in the Series or DataFrame column.

- **Parameters**:
  - `chars` (optional): A string specifying the set of characters to be removed. If not provided, it defaults to removing whitespace.

`str.rstrip()`

- **Purpose**: Removes trailing (right) whitespace (or specified characters) from each string in the Series or DataFrame column. 

- **Parameters**:
  - `chars` (optional): A string specifying the set of characters to be removed. If not provided, it defaults to removing whitespace.

In [195]:
# Create a DataFrame
df1 = pd.DataFrame({
    'Name': [' Alice ', ' Bob', '---Charlie']
})

# Remove leading whitespace
df1['Name_lstripped'] = df1['Name'].str.lstrip()

# Remove specific leading characters
df1['Name_custom_lstripped'] = df1['Name'].str.lstrip('-')

df1

Unnamed: 0,Name,Name_lstripped,Name_custom_lstripped
0,Alice,Alice,Alice
1,Bob,Bob,Bob
2,---Charlie,---Charlie,Charlie


## `str.replace()`

Pandas provides the `str.replace()` method to perform vectorized string replacement operations on Series or DataFrame columns of string types. 

`str.replace(pat, repl, case)`

- **Purpose**: Replaces occurrences of a specified substring or regular expression pattern in each string of the Series or DataFrame column with a new string.

- **Parameters**:
  - `pat` (required): A string or regular expression pattern that specifies the substring or pattern to be replaced.
  - `repl` (required): A string or callable that specifies the replacement for each match. If a callable is used, it is passed a regex match object and must return a replacement string to be used.
  - `n` (optional): An integer specifying the maximum number of replacements to make (for each string). If not specified, all occurrences will be replaced. Defaults to `-1`, which means replace all.
  - `case` (optional): If `True`, performs a case-sensitive replacement. Defaults to `True`.
  - `regex` (optional): If `True`, treats `pat` as a regular expression pattern. Defaults to `True`. If `False`, treats `pat` as a literal string.
  - `flags` (optional): A set of regular expression flags that are passed to the underlying `re.sub()` function. Common flags include `re.IGNORECASE`, `re.MULTILINE`, etc.

- **Usage Example**:
  ```python


In [196]:
# Create a DataFrame
df1 = pd.DataFrame({
    'Text': ['Hello, World!', 'Hello, Python!', 'Hello, Pandas!']
})

# Replace occurrences of 'Hello' with 'Hi'
df1['Text_replaced'] = df1['Text'].str.replace('Hello', 'Hi')

# Replace only the first occurrence (for each string) of 'o' with '0' 
df1['Text_replaced_first'] = df1['Text'].str.replace('o', '0', n=1)

# Replace using a regular expression pattern
# Replace any digit with a '#' symbol
df1['Text_replaced_regex'] = df1['Text'].str.replace(r'[a-f]', '#', regex=True)

df1


Unnamed: 0,Text,Text_replaced,Text_replaced_first,Text_replaced_regex
0,"Hello, World!","Hi, World!","Hell0, World!","H#llo, Worl#!"
1,"Hello, Python!","Hi, Python!","Hell0, Python!","H#llo, Python!"
2,"Hello, Pandas!","Hi, Pandas!","Hell0, Pandas!","H#llo, P#n##s!"


## `str.startswith()` and `str.endswith()`

Pandas provides the `str.startswith()` and `str.endswith()` methods to check if each string in a Series or DataFrame column starts or ends with a specified substring. These methods are useful for filtering and searching within text data.

`str.startswith(pat)`

- **Purpose**: Checks if each string in the Series or DataFrame column starts with the specified substring. It returns a boolean Series or DataFrame indicating `True` for strings that start with the given prefix and `False` otherwise.

- **Parameters**:
  - `pat` (required): A string specifying the prefix to look for at the start of each string.
  - `na` (optional): A boolean or scalar value to be returned when a string is `NaN`. Defaults to `False`.


`str.endswith(pat)`

- **Purpose**: Checks if each string in the Series or DataFrame column ends with the specified substring. It returns a boolean Series or DataFrame indicating `True` for strings that end with the given suffix and `False` otherwise.

- **Parameters**:
  - `pat` (required): A string specifying the suffix to look for at the end of each string.
  - `na` (optional): A boolean or scalar value to be returned when a string is `NaN`. Defaults to `False`.

In [197]:
# Create a DataFrame
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Albert', None]
})

# Check if each name starts with 'A'
df1['StartsWith_A'] = df1['Name'].str.startswith('A')

# Check if each name starts with 'A' and handle NaN as "False"
df1['StartsWith_A_Na_True'] = df1['Name'].str.startswith('A', na=False)

df1

Unnamed: 0,Name,StartsWith_A,StartsWith_A_Na_True
0,Alice,True,True
1,Bob,False,False
2,Charlie,False,False
3,Albert,True,True
4,,,False


## `str.len()`

Pandas provides the `str.len()` method to compute the length of each string in a Series or DataFrame column. This method returns an integer representing the number of characters in each string.

`str.len()`

- **Purpose**: Calculates the number of characters in each string of the Series or DataFrame column. This is useful for text analysis, such as determining the length of strings or filtering based on string length.

- **Parameters**:
  - There are no parameters for `str.len()`. It directly computes the length of each string.


In [198]:
# Create a DataFrame
df1 = pd.DataFrame({
    'Text': ['Alice', 'Bob', 'Charlie', 'Dave', None]
})

# Calculate the length of each string
df1['Length'] = df1['Text'].str.len()

df1

Unnamed: 0,Text,Length
0,Alice,5.0
1,Bob,3.0
2,Charlie,7.0
3,Dave,4.0
4,,


## `str.join()`

Pandas provides the `str.join()` method to join elements of a Series or DataFrame column of strings using a specified separator. This method is useful for combining text data into a single string with a chosen delimiter.

`str.join(sep)`

- **Purpose**: Joins elements of a Series or DataFrame column of strings into a single string, separated by a specified delimiter. This is useful for creating concatenated strings or combining text data.

- **Parameters**:
  - `sep` (required): The separator string to use between the elements being joined. It can be any string or character you want to use as the delimiter.


In [199]:
s = pd.Series([['lion', 'elephant', 'zebra'],
               [1.1, 2.2, 3.3],
               ['cat', None, 'dog'],
               ['cow', 4.5, 'goat'],
               ['duck', ['swan', 'fish'], 'guppy']])

# Joins elements in the same index
s.str.join('-')

0    lion-elephant-zebra
1                    NaN
2                    NaN
3                    NaN
4                    NaN
dtype: object

---
---

# Date/Time Manipulation
The following commands can used to merge and join multiple data frames together.

---
---

## `.to_datetime()`

<u> This method is specific to the PANDAS library. `.strptime()` is part of Python's standard `datetime` module. </u>

Pandas provides the `to_datetime()` function to convert a Series or DataFrame column to datetime objects.

`### ``to_datetime(arg, format)`

- **Purpose**: Converts a Series or DataFrame column of strings, integers, or other formats into datetime objects. This is useful for date-time operations and analysis.

- **Parameters**:
  - `arg` (required): The input data to convert. This can be a string, list, Series, or DataFrame.
  - `format` (optional): A string representing the format of the input data. It allows you to specify how the date-time information is formatted in the input data.
    - Default formats: 
      - 12-08-2003
      - 12-Aug-2003
  - `errors` (optional): Defines how to handle errors during conversion. Options are:
    - `'raise'`: Raise an error if conversion fails (default).
    - `'coerce'`: Coerce errors to `NaT` (Not a Time) for failed conversions.
    - `'ignore'`: Return the input unchanged if conversion fails.
  - `dayfirst` (optional): A boolean indicating whether to interpret the day first in ambiguous formats. Defaults to `False`.
  - `utc` (optional): A boolean indicating whether to convert the datetime to UTC timezone. Defaults to `False`.


In [200]:
# Create a DataFrame with date strings
df1 = pd.DataFrame({
    'Date': ['2024-08-23', '2024-08-24', '2024-08-25']
})
# Convert the 'Date' column to datetime
df1['Date'] = pd.to_datetime(df1['Date'])


# Create a DataFrame with date strings in different formats
df2 = pd.DataFrame({
    'Date': ['23/08/2024', '24/08/2024', '25/08/2024']
})
# Convert the 'Date' column to datetime with a specific format
df2['Date'] = pd.to_datetime(df2['Date'], format='%d/%m/%Y')


# Handle errors by coercing to NaT
df3 = pd.DataFrame({
    'Date': ['2024-08-23', 'invalid_date', '2024-08-25']
})
df3['Date'] = pd.to_datetime(df3['Date'], errors='coerce')

print(df1)
print("")
print(df2)
print("")
print(df3)


        Date
0 2024-08-23
1 2024-08-24
2 2024-08-25

        Date
0 2024-08-23
1 2024-08-24
2 2024-08-25

        Date
0 2024-08-23
1        NaT
2 2024-08-25


## `.strftime()`

Pandas provides the `strftime()` method to format datetime objects into strings based on a specified format.

`strftime(format)`

- **Purpose**: Converts datetime objects to strings formatted according to a specified format string. 

- **Parameters**:
  - `format` (required): A string representing the format in which to output the datetime. This string can include various format codes to specify different parts of the date and time.

- **Format Codes**:
  - **`%Y`**: Year with century (e.g., 2024)
  - **`%m`**: Month as a zero-padded decimal number (01 to 12)
  - **`%d`**: Day of the month as a zero-padded decimal number (01 to 31)
  - **`%H`**: Hour (24-hour clock) as a zero-padded decimal number (00 to 23)
  - **`%M`**: Minute as a zero-padded decimal number (00 to 59)
  - **`%S`**: Second as a zero-padded decimal number (00 to 59)

In [201]:
# Create a datetime object
dt = pd.Timestamp('2024-08-23 14:30:00')

# Format the datetime object to a string
formatted_date = dt.strftime('%d/%m/%Y %H:%M:%S')

print(formatted_date)  # Output: '23/08/2024 14:30:00'

23/08/2024 14:30:00
