In [49]:
import pandas as pd
df = pd.read_csv("titanic.csv")

---
---

# Data Exploration
The following commands can be used to explore and understand the data before applying analysis to it.

---
---

## `.head()` and `.tail()`

Pandas provides two useful methods for examining the top and bottom rows of a DataFrame: `head()` and `tail()`.

`df.head(n)`

- **Purpose**: Returns the first/last `n` rows of the DataFrame.
- **Parameters**:
  - `n` (optional): The number of rows you want to observe from the top of the dataset. If not specified, it defaults to 5.
- **Usage Example**:
  ```python 
  df.head(10)  # Returns the first 10 rows of the DataFrame
  ```

In [50]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


## `.sample()`

Pandas provides the `sample()` method to randomly select a sample of rows from a DataFrame.

`df.sample(n)`

- **Purpose**: Returns a random sample of `n` rows from the DataFrame. This is useful for quickly examining a subset of the data.
- **Parameters**:
  - `n` (optional): The number of rows to return. If not specified, it defaults to 1.
  - `frac` (optional): A fraction of rows to return. If specified, `n` should not be used. (frac = 0.1, return 10% of the rows)
  - `replace` (optional): Whether to allow sampling of the same row more than once (`True`) or not (`False`). The default is `False`.
  - `random_state` (optional): A seed for the random number generator for reproducibility. 
  - `axis` (optional): Axis to sample from. Defaults to 0 (rows). Set to 1 for sampling columns.
- **Usage Example**:
  ```python
  df.sample(5)  # Returns a random sample of 5 rows from the DataFrame
  df.sample(frac=0.1)  # Returns a random sample of 10% of the rows
  df.sample(n=3, random_state=42)  # Returns 3 random rows, with a fixed seed for reproducibility


In [51]:
df.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
464,465,0,3,"Maisner, Mr. Simon",male,,0,0,A/S 2816,8.05,,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
820,821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gr...",female,52.0,1,1,12749,93.5,B69,S


In [52]:
df.sample(frac = 0.01)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
190,191,1,2,"Pinsky, Mrs. (Rosa)",female,32.0,0,0,234604,13.0,,S
729,730,0,3,"Ilmakangas, Miss. Pieta Sofia",female,25.0,1,0,STON/O2. 3101271,7.925,,S
601,602,0,3,"Slabenoff, Mr. Petco",male,,0,0,349214,7.8958,,S
304,305,0,3,"Williams, Mr. Howard Hugh ""Harry""",male,,0,0,A/5 2466,8.05,,S
747,748,1,2,"Sinkkonen, Miss. Anna",female,30.0,0,0,250648,13.0,,S
350,351,0,3,"Odahl, Mr. Nils Martin",male,23.0,0,0,7267,9.225,,S
798,799,0,3,"Ibrahim Shawah, Mr. Yousseff",male,30.0,0,0,2685,7.2292,,C
44,45,1,3,"Devaney, Miss. Margaret Delia",female,19.0,0,0,330958,7.8792,,Q
318,319,1,1,"Wick, Miss. Mary Natalie",female,31.0,0,2,36928,164.8667,C7,S


## `.shape`

Pandas provides the `shape` attribute to quickly check the dimensionality of a DataFrame.

`df.shape`

- **Purpose**: Returns a tuple representing the dimensions of the DataFrame, (no. rows, no. columns).
- **Parameters**: 
  - None. `shape` is an attribute, not a method, so it doesn't take any arguments.
- **Usage Example**:
  ```python
  df.shape  # Returns a tuple (number_of_rows, number_of_columns)


In [53]:
df.shape # 891 rows, 12 columns

(891, 12)

## `.columns`

Pandas provides the `columns` attribute to get the column labels of a DataFrame.

`df.columns`

- **Purpose**: Returns an index object containing the column labels of the DataFrame. 
- **Parameters**: 
  - None. `columns` is an attribute, not a method, so it doesn't take any arguments.
- **Usage Example**:
  ```python
  df.columns  # Returns an index of the column labels


In [54]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## `.info()`

Pandas provides the `info()` method to quickly get a summary of the DataFrame.

`df.info()`

- **Purpose**: Provides a concise summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.
- **Parameters**:
  - `verbose` (optional): Whether to print the full summary (`True`) or a truncated one (`False`). Defaults to `None`, which automatically decides based on the number of columns.
  - `max_cols` (optional): Specifies the maximum number of columns to display. Defaults to `None`.
- **Usage Example**:
  ```python
  df.info()  # Prints a concise summary of the DataFrame


In [55]:
df.info() # Embarked, Age, Cabin has some null values (as their "non-null" values less than 891)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## `.describe()`

Pandas provides the `describe()` method to generate descriptive statistics for the DataFrame. Used <u>specifically for numerical attributes with no null values</u>.

`df.describe()`

- **Purpose**: Generates a summary of statistics for numerical columns in the DataFrame, including count, mean, standard deviation, min, max, and the quartile values (25%, 50%, and 75%). 
- **Parameters**:
  - `percentiles` (optional): A list of percentiles to include in the output. Defaults to `[0.25, 0.5, 0.75]`.
  - `include` (optional): Specifies the data types to include in the summary. Can be `None` (default), `all`, or a list of data types.
  - `exclude` (optional): Specifies the data types to exclude from the summary.
  - `datetime_is_numeric` (optional): Whether to treat datetime data as numeric when calculating statistics. Defaults to `False`.
- **Usage Example**:
  ```python
  df.describe()  # Returns descriptive statistics for numerical columns WITH NO NULL VALUES 
  df.describe(include='all')  # Returns descriptive statistics for all columns


In [56]:
df.describe() # Numerical columns with no null values

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [57]:
df.describe(include = 'all') # All numerical columns (even with nulls)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


## `.unique()`

Pandas provides the `unique()` method to find the unique values in a Series or DataFrame column.

`df['column'].unique()`

- **Purpose**: Returns an array of the unique values in a Series or DataFrame column.
- **Parameters**:
  - None.
- **Usage Example**:
  ```python
  df['column'].unique()  # Returns an array of unique values in the specified column


In [58]:
# Two ways to access columns. Both achieve the same thing (second one is cleaner for single columns)
df['Embarked'].unique()
df.Embarked.unique()


array(['S', 'C', 'Q', nan], dtype=object)

## `.value_counts()`

Pandas provides the `value_counts()` method to count the number of occurrences of each unique value in a Series or DataFrame column.

`df['column'].value_counts()`

- **Purpose**: Returns a Series containing counts of unique values in descending order. 
- **Parameters**:
  - `dropna` (optional): If `True` (default), missing values (`NaN`) are excluded. If `False`, missing values are included in the counts.
  - `normalize` (optional): If `True`, the method returns the relative frequencies of the unique values instead of the raw counts (proportion/%). Defaults to `False`.
  - `sort` (optional): If `True` (default), the counts are sorted in descending order. If `False`, the counts are returned in the order they appear.
  - `ascending` (optional): If `True`, the counts are sorted in ascending order. Defaults to `False`.

- **Usage Example**:
  ```python
  df['column'].value_counts()  # Returns a Series with counts of unique values in descending order
  df['column'].value_counts(normalize=True)  # Returns the relative frequencies of unique values


In [59]:
df.Pclass.value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [60]:
df.Pclass.value_counts(normalize=True)

Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

## `.isin()`

Pandas provides the `isin()` method to check if each element in a DataFrame or Series is contained in a specified set of values.

`df['column'].isin(values)`

- **Purpose**: Returns a boolean Series indicating whether each element in the Series or DataFrame column is contained in the given list, set, or other iterable of values.
- **Parameters**:
  - `values`: A <u>list, set, or array-like object</u> containing the values to check against. It can also be a Series or DataFrame.
- **Usage Example**:
  ```python
  df['column'].isin([value1, value2, value3])  # Returns a boolean Series indicating membership


In [61]:
df.Sex.isin(['male'])

0       True
1      False
2      False
3      False
4       True
       ...  
886     True
887    False
888    False
889     True
890     True
Name: Sex, Length: 891, dtype: bool

---
---

# Missing Data
The following commands can used to identify and deal with missing data.

---
---

## `.isna()`

Pandas provides the `isna()` method to detect missing values (NaN) in a DataFrame or Series.

`df.isna()`

- **Purpose**: Returns a DataFrame or Series of the same shape as the original, with boolean values indicating where each element is missing (NaN). 
- **Parameters**:
  - None. `isna()` is a method that does not take any arguments.
- **Usage Example**:
  ```python
  df.isna()  # Returns a DataFrame or Series of boolean values indicating missing data


In [62]:
example = pd.DataFrame({
    "Cabin": df.Cabin,
    "isna()": df.Cabin.isna()
    })

example

Unnamed: 0,Cabin,isna()
0,,True
1,C85,False
2,,True
3,C123,False
4,,True
...,...,...
886,,True
887,B42,False
888,,True
889,C148,False


## `.dropna()`

Pandas provides the `dropna()` method to remove missing values (NaN) from a DataFrame or Series.

`df.dropna()`

- **Purpose**: Removes rows or columns with missing values from the DataFrame or Series. By default drops any rows with NA values.
- **Parameters**:
  - `axis` (optional): Specifies whether to drop rows (`axis=0`, default) or columns (`axis=1`).
  - `how` (optional): Determines which rows or columns to drop based on missing values. 
    - `'any'`: Drop if any missing values are present.
    - `'all'`: Drop if all values are missing.
  - `thresh` (optional): An integer value specifying the minimum number of non-NA values required to keep the row or column.
  - `subset` (optional): Specifies a subset of columns or rows to consider for missing values.
  - `inplace` (optional): If `True`, performs the operation in-place without returning a new DataFrame. Defaults to `False`.
- **Usage Example**:
  ```python
  df.dropna()  # Removes rows with any missing values
  df.dropna(axis=1)  # Removes columns with any missing values
  df.dropna(how='all')  # Removes rows where all values are missing
  df.dropna(thresh=2)  # Removes rows with fewer than 2 non-null values
  df.dropna(subset=['A', 'B'])  # Removes rows where any of the specified columns have missing values
  df.dropna(inplace=True)  # Removes rows with any missing values and modifies the original DataFrame


In [72]:
df.dropna(subset = ["Age"], inplace = True) # Remove rows where "Age" has NA values  
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## `.fillna()`

Pandas provides the `fillna()` method to fill missing values (NaN) in a DataFrame or Series with specified values.

`df.fillna(value=None)`

- **Purpose**: Replaces missing values with a specified value, method, or forward/backward fill. This is useful for handling missing data by imputing it with specific values or by propagating existing values.
- **Parameters**:
  - `value` (optional): The value(s) to use for filling missing values. It can be a scalar, dictionary, Series, or DataFrame.
  - `method` (optional): The method to use for filling missing values:
    - `'ffill'` or `'pad'`: Forward fill. Propagates the last valid value forward.
    - `'bfill'` or `'backfill'`: Backward fill. Propagates the next valid value backward.
  - `axis` (optional): Specifies the axis to fill (0 for rows, 1 for columns). Default is `None`.
  - `inplace` (optional): If `True`, modifies the original DataFrame or Series in place. Defaults to `False`.
  - `limit` (optional): The maximum number of missing values to fill. Default is `None`, which means no limit.
- **Usage Example**:
  ```python
  df.fillna(value=0)  # Replaces all missing values with 0
  df.fillna(value={'A': 0, 'B': 1})  # Fills missing values with 0 in column 'A' and 1 in column 'B'


In [81]:
# Fills missing value with "No Cabin Identified" for column "Cabin" and...
df.fillna(value = {"Cabin": "No Cabin Identified", "Embarked": "Not Recorded"}, inplace = True) 
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


---
---

# Selecting Data
The following commands can used select data.

---
---

## `.iloc[]`

Pandas provides the `iloc[]` method for integer-location based indexing for selection by position in a DataFrame or Series.

`df.iloc[rows, columns]`

- **Purpose**: Allows for selection and filtering of rows and columns based on their integer position. 
- **Parameters**:
  - `rows`: The integer index or slice for selecting rows.
  - `columns`: The integer index or slice for selecting columns. Can be omitted if selecting only rows.
- **Usage Example**:
  ```python
  df.iloc[0]  # Returns the first row as a SERIES 
  df.iloc[[0]]  # Returns the first row as a DATA FRAME
  df.iloc[ : , [1]]  # Returns the second column as a Series
  df.iloc[0:3, 1:3]  # Returns a subset of the DataFrame from rows 0 to 2 and columns 1 to 2 (upper bound is not inclusive)
  df.iloc[[0, 2], [1, 2]]  # Returns the rows at index 0 and 2 and columns at index 1 and 2


In [89]:
# Return the first 10 rows, showing columns 0 (PassengerId), 3 (Name) and 6 (SibSp)
df.iloc[0:10, [0,3,6]] 

Unnamed: 0,PassengerId,Name,SibSp
0,1,"Braund, Mr. Owen Harris",1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,3,"Heikkinen, Miss. Laina",0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,5,"Allen, Mr. William Henry",0
6,7,"McCarthy, Mr. Timothy J",0
7,8,"Palsson, Master. Gosta Leonard",3
8,9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",0
9,10,"Nasser, Mrs. Nicholas (Adele Achem)",1
10,11,"Sandstrom, Miss. Marguerite Rut",1


## `.loc[]`

Pandas provides the `loc[]` method for label-based indexing to select rows and columns by their labels.

`df.loc[rows, columns]`

- **Purpose**: Allows for selection and filtering of rows and columns based on their labels. 
- **Parameters**:
  - `rows`: The label(s) for selecting rows. Can be a single label, a list of labels, or a slice.
  - `columns`: The label(s) for selecting columns. Can be a single label, a list of labels, or a slice. Can be omitted if selecting only rows.
- **Usage Example**:
  ```python
  df.loc['row_label']  # Returns a SERIES with the data for the specified row label (will be the index by default, less "set_index" is used)
  df.loc[['row_label']] # Returns a DATA FRAME with the data for the specified row label (will be the index by default, less "set_index" is used)
  df.loc[ : , ['column_label']]  # Returns a Series with the data for the specified column label
  df.loc['row1':'row3', 'col1':'col3']  # Returns a subset of the DataFrame from rows 'row1' to 'row3' and columns 'col1' to 'col3'
  df.loc[['row1', 'row3'], ['col2', 'col3']]  # Returns the data for specified rows and columns


In [92]:
# Returns the first 4 passengers with columns PassengerId, Name and Age (in that order)
df.loc[0:3, ["PassengerId", "Name", "Age"]]

Unnamed: 0,PassengerId,Name,Age
0,1,"Braund, Mr. Owen Harris",22.0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,3,"Heikkinen, Miss. Laina",26.0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0


---
---

# Data Transformation
The following commands can used transform data such as replacing values, adding in new values or removing values.

---
---

## `.apply()`

Pandas provides the `apply()` method to apply a function along an axis of the DataFrame or Series. This is useful for performing operations or transformations on data.

`df.apply(func, axis=0)`

- **Purpose**: Applies a function along the specified axis of the DataFrame or Series. This can be used to perform complex calculations or transformations on your data.
- **Parameters**:
  - `func`: The function to apply. This can be a user-defined function, a lambda function, or a built-in function.
  - `axis` (optional): Specifies whether to apply the function to rows (`axis=0`, default) or columns (`axis=1`).
  - `result_type` (optional): Determines the format of the result (only applicable when `axis=1`). Options are `expand`, `reduce`, or `broadcast`.
- **Usage Example**:
  ```python
  df.apply(lambda x: x + 1)  # Adds 1 to every element in the DataFrame
  df.apply(lambda x: x.mean(), axis=0)  # Computes the mean of each column
  df.apply(lambda x: x.max() - x.min(), axis=1)  # Computes the range (max - min) for each row



In [99]:
# Add 1 to every value in the column "Age" and "Fare" 
df[['Age', 'Fare']].apply(lambda x: x+1)

Unnamed: 0,Age,Fare
0,23.0,8.2500
1,39.0,72.2833
2,27.0,8.9250
3,36.0,54.1000
4,36.0,9.0500
...,...,...
885,40.0,30.1250
886,28.0,14.0000
887,20.0,31.0000
889,27.0,31.0000


In the example below, we create a new row that classifies whether a person is Old or Young.

In [105]:
def replacement(row):
    if row < 25: 
        return "Young"
    elif 25 <= row < 40:
        return "Middle-Aged"
    else:
        return "Young"
    
df['Age Classification'] = df["Age"].apply(replacement)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Young
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


## `replace()`

Pandas provides the `replace()` method to replace specified values with other values within a DataFrame or Series. values.

`df.replace(to_replace, value=None, inplace=False, limit=None, regex=False)`

- **Purpose**: Replaces occurrences of `to_replace` with `value`. This can be used to clean or transform data by substituting specific values.
- **Parameters**:
  - `to_replace`: The value or pattern to be replaced. This can be a scalar, list, dictionary, or regex pattern.
  - `value` (optional): The value to replace `to_replace` with. Can be a scalar, list, or dictionary. If not specified, `to_replace` will be replaced with `None`.
  - `inplace` (optional): If `True`, performs the operation in place and modifies the DataFrame/Series directly. Defaults to `False`.
  - `limit` (optional): The maximum number of replacements to make. If not specified, all occurrences are replaced.
  - `regex` (optional): If `True`, treats `to_replace` as a regex pattern. Defaults to `False`.
- **Usage Example**:
  ```python
  df.replace(2, 10)  # Replaces all occurrences of 2 with 10 in the DataFrame
  df["column_name"].replace(replacement_dictionary)  # Replace values in column "column_name" based on a dictionary 

  df["Age"].replace(50, "Mid-Life") # Replace 50 with "Mid-Life" in the "Age" column only 
  df.replace({'Age': 50}, {'Age': "Mid-Life"})  # Replace 50 with "Mid-Life" in the "Age" column only 


In [112]:
# Define a replacement dictionary 
replacement_dictionary = {
    1: "First Class",
    2: "Second Class",
    3: "Third Class"
}

# Replace the 1, 2, 3 in "Pclass" with First Class, Second Class and Third Class
df.Pclass = df.Pclass.replace(replacement_dictionary)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Classification
0,1,0,Third Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,No Cabin Identified,S,Young
1,2,1,First Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Middle-Aged
2,3,1,Third Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,No Cabin Identified,S,Middle-Aged
3,4,1,First Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Middle-Aged
4,5,0,Third Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,No Cabin Identified,S,Middle-Aged
...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,Third Class,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,No Cabin Identified,Q,Middle-Aged
886,887,0,Second Class,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,No Cabin Identified,S,Middle-Aged
887,888,1,First Class,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Young
889,890,1,First Class,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Middle-Aged


## `melt()`

Pandas provides the `melt()` method to unpivot or transform a DataFrame from a wide format to a long format.

`df.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`

- **Purpose**: Converts columns of a DataFrame into rows. This is helpful for normalizing data or for preparing data for certain types of analyses or visualizations.
- **Parameters**:
  - `id_vars` (optional): Columns to set as identifier variables. These columns will remain unchanged in the output.
  - `value_vars` (optional): Columns to unpivot into rows. If not specified, all columns not included in `id_vars` will be used.
  - `var_name` (optional): Name to use for the new column that will contain the former column names. Defaults to `'variable'`.
  - `value_name` (optional): Name to use for the new column that will contain the values from the original columns. Defaults to `'value'`.
  - `col_level` (optional): If columns have multiple levels (MultiIndex), specifies which level to use for unpivoting.



**What is WIDE vs LONG formatted data?**
- <u> Wide data (more common) </U>: each individual individual occupies their own row, and each of their variables occupy a single column.
  - Considered "people friendly" as it is easy to read and interpret (all information about an indivdual is available at a single glance)
  - An easy way to identify wide data, is that the first column tends not to repeat 
- <u> Long data (desired) </U>: adsad asdjnasjd

  


In [115]:
# Create the WIDE DataFrame
df = pd.DataFrame({
    'Date': ['2024-08-01', '2024-08-02', '2024-08-03'],
    'Product_A': [10, 20, 30],
    'Product_B': [15, 25, 35]
})
df


Unnamed: 0,Date,Product_A,Product_B
0,2024-08-01,10,15
1,2024-08-02,20,25
2,2024-08-03,30,35


Above is the original WIDE data frame. Below is the melted LONG data frame.

In [117]:
# Melt the DataFrame
df_melted = df.melt(id_vars='Date', value_vars=['Product_A', 'Product_B'], var_name='Product', value_name='Sales')
df_melted

Unnamed: 0,Date,Product,Sales
0,2024-08-01,Product_A,10
1,2024-08-02,Product_A,20
2,2024-08-03,Product_A,30
3,2024-08-01,Product_B,15
4,2024-08-02,Product_B,25
5,2024-08-03,Product_B,35
