# Pandas
1. Pandas is Python’s ETL package for structured data.
1. Built on top of NumPy, designed to mimic the functionality of R DataFrames.
1. Provides a convenient way to handle tabular data.
1. Can perform all SQL functionalities, including group-by and join.
1. Compatible with many other data science packages, including visualisation packages such as Matplotlib and Seaborn.
1. Defines two main data types:
- pandas.Series
- pandas.DataFrame

# Series
### A Series is a one-dimensional, single type array.
 - Represents a column in a table.
### It consists of two NumPy arrays:
 - Index array
 - Values array
### Each element has a unique index (ID).
 - Indices do not have to be sequential.
 - Like primary keys.

## A Pandas Series can be created from a Python list or a NumPy array:
```python
import pandas as pd
x = [1,3,5,7]
mySeries = pd.Series(x)
print(mySeries)
```
The index starts from 0 and increases by 1 for each subsequent element in the Series.
The index is used to access the corresponding value.
```python
print(mySeries[1])
```
Output:
```
3
```


In [5]:
import pandas as pd
x = [1,3,5,7]
mySeries = pd.Series(x)
print(mySeries)
print(mySeries.index)
print(mySeries.values)


0    1
1    3
2    5
3    7
dtype: int64
RangeIndex(start=0, stop=4, step=1)
[1 3 5 7]


In [13]:
import pandas as pd
x = [1,3,5,7]
mySeries = pd.Series(x)
print(mySeries[1])

3


# Querying Series
Series can be used like an array, except the indices must
correspond to the elements in the index array.
- **series.index** returns the index array.
- **series.values** returns the values array.
- **series[ind]** is equivalent to series.loc[ind], returns the
element in the series with ID equal to ind.
- **series.iloc[i]** returns the i-th element in the series.

# pandas **loc** and **iloc** compared

Note: `iloc` and `loc` are both indexing methods in pandas, but they have different purposes and syntax.

**`iloc`**:

- `iloc` is primarily used for integer-location based indexing, meaning you use integer positions to select data.
- It allows you to select rows and columns by their integer index positions.
- You can pass single integers, slices, lists, or boolean arrays.
- The syntax is `dataframe.iloc[row_index, column_index]` or `series.iloc[index]` for a pandas Series.
- For example:
```python
  import pandas as pd
  
  # Create a Series
  s = pd.Series([1, 2, 3, 4, 5])
  
  # Select the value at position 2
  print(s.iloc[2])  # Output: 3
```

**`loc`**:

- `loc` is primarily label-based indexing, meaning you use the labels of rows and columns to select data.
- It allows you to select rows and columns by their labels (index and column names).
- You can pass single labels, lists of labels, slices, or boolean arrays.
- The syntax is `dataframe.loc[row_label, column_label]` or `series.loc[label]` for a pandas Series.
- For example:
```python
  import pandas as pd
  
  # Create a Series with custom index
  s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
  
  # Select the value with index label 'c'
  print(s.loc['c'])  # Output: 3
```

In summary:

- Use `iloc` when you want to select data by integer position.
- Use `loc` when you want to select data by label.

# Querying Series
```python
import pandas as pd
X=[1, 3, 5, 7]
mySeries=pd.Series(X,index=[9,8,7,6])
print(mySeries)
```

In [33]:
import pandas as pd
X=[1, 3, 5, 7]
mySeries=pd.Series(X,index=[9,8,7,6])
print(mySeries)



9    1
8    3
7    5
6    7
dtype: int64


In [35]:
mySeries.index

Index([9, 8, 7, 6], dtype='int64')

In [37]:
mySeries.values

array([1, 3, 5, 7], dtype=int64)

In [39]:
mySeries.loc[7]

5

In [41]:
mySeries.iloc[0]

1

# DataFrame
### A Pandas DataFrame represents a table, and it contains:
  - data in the form of rows and columns
  - RowIDs (the index array, i.e. primary key)
  - Column names (ID of the columns).
Equivalent to collection of Series
  - The row indices start from 0 (default) and increase by 1
DataFrames are the data structures most suitable for analytics.
  - Rows represent observations
  - columns represent attributes of different data types.

# Creating dataFrames
- from Python lists or NumPy arrays:

In [50]:
data = {
    "age": [34, 42, 27],
    "height": [1.78,1.82, 1.75],
    "weight": [75,80,70]
}
df= pd.DataFrame(data)
print(df)

   age  height  weight
0   34    1.78      75
1   42    1.82      80
2   27    1.75      70


- Use a dictionary with Column Names as keys and a list of the row values
- Creating from CSV file:
```python
pandas.read_csv(csv_file_name)
```
  - the first row is used for column names

# Reading CSV files
read_csv reads a comma delimited file into a DataFrame.
 - Can pass a path or URL to be read from.
 - Parameters control how to read.
   - e.g., whether to parse dates or not.


In [1]:
import pandas as pd
df = pd.read_csv("data/loan_data.csv")
df[:2]


Unnamed: 0,ID,Income,Term,Balance,Debt,Score,Default
0,567,17500,Short Term,1460,272,225.0,False
1,523,18500,Long Term,890,970,187.0,False


# Reading XML & JSON
 - read_json reads a JSON file into a DataFrame.
 - read_xml reads a XML file into a DataFrame.
 - Can pass a path or URL to be read from.
 - Parameters control how to read.

In [None]:
# JSON file for this example: "data/weather_data.json"
{
  "day": {
    "0": "2023-07-15", "1": "2023-07-16",
    "2": "2023-07-17", "3": "2023-07-18",
    "4": "2023-07-19", "5": "2023-07-20",
    "6": "2023-07-21", "7": "2023-07-22",
    "8": "2023-07-23", "9": "2023-07-24"
  },
  "temp": {
    "0": 15.68, "1": 25.16,
    "2": 13.26, "3": 24.63,
    "4": 12.78, "5": 23.52,
    "6": 17.8, "7": 24.98,
    "8": 23.48, "9": 23.3
  },
  "humidity": {
    "0": 73.18, "1": 83.88,
    "2": 80.05, "3": 82.37,
    "4": 83.1, "5": 85.35,
    "6": 85.64, "7": 76.81,
    "8": 80.86, "9": 79.96
  },
  "sun_hrs": {
    "0": 6.4, "1": 8.06,
    "2": 4.89, "3": 9.13,
    "4": 17.1, "5": 0.72,
    "6": 5.79, "7": 10.95,
    "8": 3.77, "9": 14.62
  }
}

In [79]:
import pandas as pd
df = pd.read_json("data/weather_data.json")
df

Unnamed: 0,day,temp,humidity,sun_hrs
0,2023-07-15,15.68,73.18,6.4
1,2023-07-16,25.16,83.88,8.06
2,2023-07-17,13.26,80.05,4.89
3,2023-07-18,24.63,82.37,9.13
4,2023-07-19,12.78,83.1,17.1
5,2023-07-20,23.52,85.35,0.72
6,2023-07-21,17.8,85.64,5.79
7,2023-07-22,24.98,76.81,10.95
8,2023-07-23,23.48,80.86,3.77
9,2023-07-24,23.3,79.96,14.62


# Querying SQL Tables
```read_sql``` reads the result of a SQL query into a DataFrame.
 - Requires appropriate connection to be set up.
  - Including correct credentials.


In [None]:
db_conn = sqlite3.connect(r"movies_db.sqlite")
df = pd.read_sql(r"SELECT * FROM movies", db_conn)
df


# Changing Types
## Changing Column Types
- Ensuring data is of the correct type is important, both technically and statistically.
- The astype method can be used to do this to Series and DataFrames.


In [19]:
import pandas as pd
df=pd.read_csv("data/loan_data.csv")
df['ID'].head()


0    567
1    523
2    544
3    370
4    756
Name: ID, dtype: int64

In [85]:
df['ID'].astype("U").head()

0    567
1    523
2    544
3    370
4    756
Name: ID, dtype: object

# Note
In Python, both single quotes (`'`) and double quotes (`"`) can be used interchangeably to define string literals. So, when specifying the target dtype using the `astype()` method, you can use either single quotes or double quotes for string representations of the dtype.

For example, the following two lines are equivalent:

```python
df['A'] = df['A'].astype('U')
df['A'] = df['A'].astype("U")
```

Both lines will achieve the same result: converting the data type of column 'A' to Unicode strings.

The `head()` method in pandas is used to display the first few rows of a DataFrame. By default, it displays the first 5 rows, but you can specify the number of rows you want to display by passing an integer argument to the method.

Here's the syntax:
```python
DataFrame.head(n=5)
```

- `n`: This argument specifies the number of rows to display from the beginning of the DataFrame. By default, it is set to `5`.

For example:
```python
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': ['a', 'b', 'c', 'd', 'e'],
                   'C': [True, False, True, False, True]})

# Display the first 3 rows
print(df.head(3))
```

Output:
```
   A  B      C
0  1  a   True
1  2  b  False
2  3  c   True
```

In this example, `df.head(3)` displays the first 3 rows of the DataFrame `df`. If you omit the argument (`n`), it defaults to `5`, and `df.head()` would display the first 5 rows. This is particularly useful for quickly inspecting the structure and contents of a DataFrame without printing the entire DataFrame, especially when dealing with large datasets.

In [3]:
import pandas as pd
df = pd.read_json("data/weather_data.json")
df['day']


0    2023-07-15
1    2023-07-16
2    2023-07-17
3    2023-07-18
4    2023-07-19
5    2023-07-20
6    2023-07-21
7    2023-07-22
8    2023-07-23
9    2023-07-24
Name: day, dtype: object

In [5]:
import pandas as pd
df = pd.read_json("data/weather_data.json")
pd.to_datetime(df['day'], format="%Y-%m-%d")


0   2023-07-15
1   2023-07-16
2   2023-07-17
3   2023-07-18
4   2023-07-19
5   2023-07-20
6   2023-07-21
7   2023-07-22
8   2023-07-23
9   2023-07-24
Name: day, dtype: datetime64[ns]

## Aside: Date Formatter Strings
The `pd.to_datetime()` function in Pandas allows you to convert a column of strings representing dates into actual datetime objects. The `format` parameter specifies the format of the input strings. Here are several common and valid date format strings that can be used with `pd.to_datetime()`:

### Common Date Format Strings

1. **Year-Month-Day (`"%Y-%m-%d"`)**:
    ```python
    pd.to_datetime(df['day'], format="%Y-%m-%d")
    ```
    - Example: "2024-06-18"

2. **Day-Month-Year (`"%d-%m-%Y"`)**:
    ```python
    pd.to_datetime(df['day'], format="%d-%m-%Y")
    ```
    - Example: "18-06-2024"

3. **Month-Day-Year (`"%m-%d-%Y"`)**:
    ```python
    pd.to_datetime(df['day'], format="%m-%d-%Y")
    ```
    - Example: "06-18-2024"

4. **Year/Month/Day (`"%Y/%m/%d"`)**:
    ```python
    pd.to_datetime(df['day'], format="%Y/%m/%d")
    ```
    - Example: "2024/06/18"

5. **Day/Month/Year (`"%d/%m/%Y"`)**:
    ```python
    pd.to_datetime(df['day'], format="%d/%m/%Y")
    ```
    - Example: "18/06/2024"

6. **Month/Day/Year (`"%m/%d/%Y"`)**:
    ```python
    pd.to_datetime(df['day'], format="%m/%d/%Y")
    ```
    - Example: "06/18/2024"

7. **Year-Month-Day Hour:Minute:Second (`"%Y-%m-%d %H:%M:%S"`)**:
    ```python
    pd.to_datetime(df['day'], format="%Y-%m-%d %H:%M:%S")
    ```
    - Example: "2024-06-18 15:30:45"

8. **Day-Month-Year Hour:Minute:Second (`"%d-%m-%Y %H:%M:%S"`)**:
    ```python
    pd.to_datetime(df['day'], format="%d-%m-%Y %H:%M:%S")
    ```
    - Example: "18-06-2024 15:30:45"

9. **Month-Day-Year Hour:Minute:Second (`"%m-%d-%Y %H:%M:%S"`)**:
    ```python
    pd.to_datetime(df['day'], format="%m-%d-%Y %H:%M:%S")
    ```
    - Example: "06-18-2024 15:30:45"

10. **Full Date and Time with Day Name (`"%A, %d %B %Y %H:%M:%S"`)**:
    ```python
    pd.to_datetime(df['day'], format="%A, %d %B %Y %H:%M:%S")
    ```
    - Example: "Tuesday, 18 June 2024 15:30:45"

11. **Year-Month-Day with T Separator (`"%Y-%m-%dT%H:%M:%S"`)**:
    ```python
    pd.to_datetime(df['day'], format="%Y-%m-%dT%H:%M:%S")
    ```
    - Example: "2024-06-18T15:30:45"

12. **Custom Format with Milliseconds (`"%Y-%m-%d %H:%M:%S.%f"`)**:
    ```python
    pd.to_datetime(df['day'], format="%Y-%m-%d %H:%M:%S.%f")
    ```
    - Example: "2024-06-18 15:30:45.123456"

### Format Specifiers

Here are some common format specifiers you can use to construct these strings:

- `%Y`: Four-digit year
- `%y`: Two-digit year
- `%m`: Two-digit month (01 to 12)
- `%d`: Two-digit day of the month (01 to 31)
- `%H`: Two-digit hour (00 to 23)
- `%M`: Two-digit minute (00 to 59)
- `%S`: Two-digit second (00 to 59)
- `%f`: Microseconds (000000 to 999999)
- `%A`: Full weekday name
- `%a`: Abbreviated weekday name
- `%B`: Full month name
- `%b`: Abbreviated month name
- `%p`: AM or PM

### Example Usage

```python
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    'day': ['2024-06-18', '2024/06/18', '18-06-2024', '06-18-2024']
})

# Convert using different formats
df['datetime_ymd'] = pd.to_datetime(df['day'], format="%Y-%m-%d", errors='coerce')
df['datetime_dmy'] = pd.to_datetime(df['day'], format="%d-%m-%Y", errors='coerce')
df['datetime_mdy'] = pd.to_datetime(df['day'], format="%m-%d-%Y", errors='coerce')

print(df)
```

### Handling Errors

- **`errors='coerce'`**: This parameter ensures that if a date does not match the format, it will be converted to `NaT` (Not a Time) instead of raising an error. This is useful when dealing with mixed or inconsistent date formats.

By using the appropriate format strings and handling parameters, you can effectively convert and manipulate date data in Pandas.

# Indexing DataFrames
Getting entire columns:
```python
my_dataframe[column_name]
```

In [96]:
import pandas as pd
df = pd.read_json("data/weather_data.json")
df['temp']
df[['temp','humidity']]

Unnamed: 0,temp,humidity
0,15.68,73.18
1,25.16,83.88
2,13.26,80.05
3,24.63,82.37
4,12.78,83.1
5,23.52,85.35
6,17.8,85.64
7,24.98,76.81
8,23.48,80.86
9,23.3,79.96


In [100]:
df['temp']

0    15.68
1    25.16
2    13.26
3    24.63
4    12.78
5    23.52
6    17.80
7    24.98
8    23.48
9    23.30
Name: temp, dtype: float64

# Row Retrieval
## Getting Entire Rows
```python
my_dataframe.loc[row_id]
```

In [7]:
df.loc[0] # Row with Index 0

day         2023-07-15
temp             15.68
humidity         73.18
sun_hrs            6.4
Name: 0, dtype: object

In [113]:
df.loc[[0,1]] # Rows with indices 0 & 1

Unnamed: 0,day,temp,humidity,sun_hrs
0,2023-07-15,15.68,73.18,6.4
1,2023-07-16,25.16,83.88,8.06


## Named row retrieval
### Indices can be named

In [142]:
import pandas as pd
import numpy as np

# Sample data
sample_data = {
    'age': [25, 30, 35],
    'weight': [70, 75, 80],
    'height': [170, 175, 180]
}

# Index list
index_list = ["ind1", "ind2", "ind3"]

# Create DataFrame with sample data and specified indices
df = pd.DataFrame(sample_data, index=index_list)

# Print the DataFrame
print(df)


      age  weight  height
ind1   25      70     170
ind2   30      75     175
ind3   35      80     180


In [137]:
df.loc["ind2"]  # row with index "ind2"

age        30
weight     75
height    175
Name: ind2, dtype: int64

In [133]:
df.iloc[2] # row with position 2

age        35
weight     80
height    180
Name: ind3, dtype: int64

# Slicing DataFrames
## getting entire columns:
```python
my_dataframe.loc[:, col_name]
my_dataframe.iloc[y:, col_position]
```

In [144]:
df.loc[:, "age"]

ind1    25
ind2    30
ind3    35
Name: age, dtype: int64

In [146]:
df.iloc[:,0]

ind1    25
ind2    30
ind3    35
Name: age, dtype: int64

# Getting individual elements from row and column IDs:
```python
my_datafram.loc[row_id, col_name]
my_dataframe.iloc[i,j]
```

In [148]:
df.loc["ind1", "height"]

170

In [150]:
df.iloc[0,1]

70

# DF Slicing Summary
```python
my_dataframe.loc[[id1, id2, id3], :]
```
returns rows id1, id2 and id3, all columns
```python
my_dataframe.loc[:, [col1, col2, col3]]
```
returns columns col1, col2 and col3, all rows
```python
my_dataframe.loc[[id1, id2, id3], [col1, col2, col3]]
```
returns 3 by 3 table of rows id1, id2 and id3, columns col1, col2, and col3


# Pandas DF Broadcasting Operations
- Like NumPy, Pandas broadcasts operations.
  - i.e., we can perform calculations with columns like we do with single values.

In [171]:
import pandas as pd
df = pd.read_csv("Data\mortgage_applicants.csv")
df['Income'].head()

0    17626
1    18959
2    20560
3    21894
4    24430
Name: Income, dtype: int64

In [173]:
(df['Income']/12).head()

0    1468.833333
1    1579.916667
2    1713.333333
3    1824.500000
4    2035.833333
Name: Income, dtype: float64

# Boolean Operators
Symbolic Boolean operators can be used to combine conditions.

In [179]:
(df["Income"] > 20000) & (df["Debt"] < 2000)

0      False
1      False
2       True
3       True
4       True
       ...  
851    False
852     True
853    False
854    False
855    False
Length: 856, dtype: bool

# DF Filtering
- DataFrames can be filtered row-wise using a sequence of Trues & Falses.
- These can be generated by queries.

In [193]:
df[(df["Income"] > 75000) & (df["Debt"] < 500)]

Unnamed: 0.1,Unnamed: 0,ID,Income,Term,Balance,Debt,Score,Default
787,787,9,76228,10 Years,4952,26,1009.0,False
791,791,738,81842,20 Years,3677,27,1001.0,False
795,795,235,77844,10 Years,4160,65,1000.0,False
798,798,753,81175,10 Years,5684,2,1006.0,False


# Aggregation - Group By
- Group table rows into sub-groups according to a specified criteria.
GROUP BY and:
• Counting the number of rows in each group:
```python
df.groupby(criteria).size()
```
• Sum of every numerical column in each group:
```python
df.groupby(criteria).sum()
```
• Mean of every numerical column in each group:
```python
df.groupby(criteria).mean()
```

In [44]:
import pandas as pd
df=pd.read_csv("data/mortgage_applicants.csv")


# Create DataFrame
#df = pd.DataFrame(data)


# Group by 'Term' column
grouped = df.groupby('Term')

# Calculate mean of 'Income' for each group
mean_income = grouped['Income'].mean()

# Calculate median of 'Balance' for each group
median_balance = grouped['Balance'].median()

# Print the results
print("Mean Income by Mortgage Term:")
print(mean_income)
print("\nMedian Balance by Mortgage Term:")
print(median_balance)


Mean Income by Mortgage Term:
Term
10 Years    27800.559932
20 Years    34461.341912
Name: Income, dtype: float64

Median Balance by Mortgage Term:
Term
10 Years    1094.5
20 Years    1236.0
Name: Balance, dtype: float64


In [208]:
import pandas as pd

# Sample data
data = {
    'ID': [567, 523, 544, 370, 756, 929, 373, 818, 284, 621,
           404, 763, 327, 664, 590, 24, 931, 400, 905, 556,
           673, 537, 616, 422, 73, 466, 363, 291, 326, 483,
           563, 838, 740, 216, 302],
    'Income': [17626, 18959, 20560, 21894, 24430, 22995, 21124, 24644, 27138, 24521,
               19166, 20838, 18630, 24182, 17887, 23596, 18223, 18271, 17887, 19330,
               24185, 20232, 22869, 20543, 16915, 21437, 17905, 22585, 18015, 24055,
               16876, 21984, 22259, 16937, 21421],
    'Term': ['10 Years', '20 Years', '10 Years', '10 Years', '10 Years', '20 Years', '10 Years', '10 Years', '20 Years', '10 Years',
             '10 Years', '10 Years', '20 Years', '10 Years', '10 Years', '10 Years', '10 Years', '10 Years', '10 Years', '10 Years',
             '10 Years', '10 Years', '20 Years', '10 Years', '10 Years', '10 Years', '10 Years', '10 Years', '10 Years', '10 Years',
             '20 Years', '10 Years', '20 Years', '10 Years', '10 Years'],
    'Balance': [1381, 883, 684, 748, 1224, 1678, 1135, 1634, 840, 1271,
                854, 946, 1316, 1160, 901, 701, 889, 991, 1214, 973,
                771, 996, 1150, 731, 925, 1298, 928, 1185, 602, 740,
                991, 959, 985, 587, 1167],
    'Debt': [293, 1012, 898, 85, 59, 1329, 115, 105, 1877, 110,
             48, 1259, 37, 882, 1586, 34, 2162, 979, 960, 29,
             77, 804, 803, 1054, 1082, 51, 13, 532, 14, 20,
             34, 633, 1906, 1, 1347],
    'Score': [228.0, 187.0, 86.0, None, 504.0, 384.0, 560.0, 309.0, 251.0, 736.0,
               186.0, 136.0, 430.0, 247.0, 68.0, 333.0, 40.0, 197.0, 130.0, 400.0,
               None, 296.0, 431.0, 89.0, 147.0, 451.0, 168.0, 487.0, 346.0, 388.0,
               340.0, 244.0, 152.0, 259.0, 198.0],
    'Default': [False, False, False, False, False, False, False, False, False, True,
                 False, False, False, False, True, False, True, True, False, False,
                 False, False, True, False, False, False, False, False, False, False,
                 False, False, True, False, False]
}

# Create DataFrame
df = pd.DataFrame(data)

# Group by 'Term' column
grouped = df.groupby('Term')

# Calculate mean of 'Income' for each group
mean_income = grouped['Income'].mean()

# Calculate median of 'Balance' for each group
median_balance = grouped['Balance'].median()

# Print the results
print("Mean Income by Mortgage Term:")
print(mean_income)
print("\nMedian Balance by Mortgage Term:")
print(median_balance)
print("size:")
print(grouped.size())


Mean Income by Mortgage Term:
Term
10 Years    20728.321429
20 Years    21389.428571
Name: Income, dtype: float64

Median Balance by Mortgage Term:
Term
10 Years    952.5
20 Years    991.0
Name: Balance, dtype: float64
size:
Term
10 Years    28
20 Years     7
dtype: int64


To group the 'Income' data by ranges instead of specific values, you can use the `pd.cut()` function to discretize the 'Income' column into bins (ranges) and then group by these bins. Here's how you can do it:

```python
import pandas as pd

df=pd.read_csv("data/mortgage_applicants.csv")

# Create DataFrame
df = pd.DataFrame(data)

# Define income bins
income_bins = [0, 20000, 25000, 30000, 35000]

# Define labels for income bins
income_labels = ['< 20k', '20k - 25k', '25k - 30k', '30k - 35k']

# Create a new column 'Income Range' with income bins
df['Income Range'] = pd.cut(df['Income'], bins=income_bins, labels=income_labels, right=False)

# Group by 'Income Range' column
grouped = df.groupby('Income Range')

# Calculate mean of 'Income' for each group
mean_income = grouped['Income'].mean()

# Print the results
print("Mean Income by Income Range:")
print(mean_income)
```

In this code, the 'Income' column is discretized into bins using `pd.cut()`, specifying the bins and labels. Then, the DataFrame is grouped by the 'Income Range' column, and the mean income for each income range is calculated using the `groupby()` method. The results show the mean income for applicants grouped by income range. Adjust the `income_bins` and `income_labels` as needed to define the desired income ranges.

In [245]:
import pandas as pd

df=pd.read_csv("data/mortgage_applicants.csv")

# Create DataFrame
#df = pd.DataFrame(data)

# Define income bins
income_bins = [0, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000]

# Define labels for income bins
income_labels = ['< 20k', '20k - 25k', '25k - 30k', '30K - 35K', '35K - 40K', '40K - 45K', '45K - 50K', '50K - 55K', '55K - 60K', '60K - 65K', '65K - 70K', '70K - 75K']

# Create a new column 'Income Range' with income bins
df['Income Range'] = pd.cut(df['Income'], bins=income_bins, labels=income_labels, right=False)

# Group by 'Income Range' column
grouped = df.groupby('Income Range', observed=False)

# Calculate mean of 'Income' for each group
mean_income = grouped['Income'].mean()

# Print the results
print("Mean Income by Income Range:")
print(mean_income)

Mean Income by Income Range:
Income Range
< 20k        18032.487069
20k - 25k    21914.092308
25k - 30k    27321.710526
30K - 35K    33377.480000
35K - 40K    37421.500000
40K - 45K    42492.876404
45K - 50K    46883.609756
50K - 55K    52598.923077
55K - 60K    57352.294118
60K - 65K    62290.210526
65K - 70K    67013.111111
70K - 75K    73730.000000
Name: Income, dtype: float64


# Pandas cut()
The `pd.cut()` function in pandas is used to segment and sort data values into bins (categories or intervals). It is particularly useful for discretizing continuous data into categorical data. The function returns a new categorical Series or array, where each element represents the bin (category) to which the corresponding element of the input belongs.

Here's the syntax of the `pd.cut()` function:

```python
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
```

- `x`: This argument represents the array or Series to be segmented into bins.
- `bins`: This argument specifies the intervals (bins) into which the data should be sorted. It can be an integer, specifying the number of equal-width bins to create, or a list/array of bin edges, defining the intervals explicitly.
- `right`: This boolean argument indicates whether the intervals should be closed on the right (`True`, default) or left (`False`) side.
- `labels`: This argument specifies the labels to assign to the resulting bins. If `None` (default), integer labels are used.
- `retbins`: This boolean argument indicates whether to return the computed bins as well as the resulting categorical Series. If `True`, it returns a tuple of the Series and the bins.
- `precision`: This argument specifies the number of decimal places to round the bins' edges.
- `include_lowest`: This boolean argument specifies whether the first interval should be left-inclusive (closed) or not.
- `duplicates`: This argument specifies how to handle duplicate bin edges if they exist.

Here's a general example of how to use `pd.cut()`:

```python
import pandas as pd

# Sample data
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define bins
bins = [0, 30, 60, 100]

# Cut the data into bins
categories = pd.cut(data, bins)

# Display the result
print(categories)
```

Output:
```
[(0, 30], (0, 30], (0, 30], (30, 60], (30, 60], (30, 60], (60, 100], (60, 100], (60, 100], (60, 100]]
Categories (3, interval[int64]): [(0, 30] < (30, 60] < (60, 100]]
```

In this example, the data `[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]` is segmented into three bins: `[0, 30)`, `[30, 60)`, and `[60, 100]`. The resulting `categories` Series shows the bin intervals to which each element belongs.

# Transform
Transform is used to calculate quantities over a group but return as many rows as input.
• Can be used to add, e.g., a grouped average column

In [222]:
df['MeanTermDebt']=df.groupby("Term")['Debt'].transform('mean')
df.head()

Unnamed: 0,ID,Income,Term,Balance,Debt,Score,Default,Income Range,MeanTermDebt
0,567,17626,10 Years,1381,293,228.0,False,< 20k,544.0
1,523,18959,20 Years,883,1012,187.0,False,< 20k,999.714286
2,544,20560,10 Years,684,898,86.0,False,20k - 25k,544.0
3,370,21894,10 Years,748,85,,False,20k - 25k,544.0
4,756,24430,10 Years,1224,59,504.0,False,20k - 25k,544.0


In [238]:
import pandas as pd
import numpy as np


df=pd.read_csv("data/mortgage_applicants.csv")

# Apply the transformation logic
#df['MeanTermDebt'] = df.groupby("Term")['Debt'].transform(np.mean)
df['MeanTermDebt'] = df.groupby("Term")['Debt'].transform("mean")

# Display the DataFrame with the new column
df.head()


Unnamed: 0.1,Unnamed: 0,ID,Income,Term,Balance,Debt,Score,Default,MeanTermDebt
0,0,567,17626,10 Years,1381,293,228.0,False,669.409247
1,1,523,18959,20 Years,883,1012,187.0,False,776.393382
2,2,544,20560,10 Years,684,898,86.0,False,669.409247
3,3,370,21894,10 Years,748,85,,False,669.409247
4,4,756,24430,10 Years,1224,59,504.0,False,669.409247


# Transform in Detail
Let's break down the logic of the provided code and then apply it to the mortgage applicants data.

The code `df['MeanTermDebt'] = df.groupby("Term")['Debt'].transform(np.mean)` can be explained as follows:

1. `df.groupby("Term")`: This part of the code groups the DataFrame `df` by the 'Term' column. It creates groups where each group corresponds to a unique value in the 'Term' column.

2. `['Debt']`: After grouping by 'Term', this part selects the 'Debt' column from each group.

3. `np.mean`: This part specifies the aggregation function to be applied to each group. In this case, it calculates the mean of the 'Debt' values within each group.

4. `transform`: This method applies the specified aggregation function (`np.mean` in this case) to each group and returns a Series with the same length as the original DataFrame, where each element corresponds to the result of the aggregation function applied to the respective group.

5. `df['MeanTermDebt']`: Finally, this part assigns the result of the transformation (i.e., the mean debt values for each group) to a new column named 'MeanTermDebt' in the original DataFrame `df`.

Now, let's apply this logic to the mortgage applicants data:

```python
import pandas as pd
import numpy as np

df=pd.read_csv("data/mortgage_applicants.csv")

# Apply the transformation logic
df['MeanTermDebt'] = df.groupby("Term")['Debt'].transform(np.mean)

# Display the DataFrame with the new column
print(df)
```

In this example, we group the DataFrame `df` by the 'Term' column and calculate the mean debt (`'Debt'`) for each group using the `np.mean` function. Then, we assign the resulting mean debt values to a new column named 'MeanTermDebt' in the original DataFrame. This provides us with the mean debt for each term category while preserving the original DataFrame's structure.