# Data manipulation With Pandas


## Pivot Tables
  
Pivot tables in pandas are a powerful tool for data summarization. They allow you to reshape and analyze data in a meaningful way. The `pivot_table` function is quite versatile, enabling you to perform various operations such as aggregation, filtering, and grouping.

Pandas can be used to create Excel style pivot tables
- View Statistics across category g

Here’s a simple example to demonstrate the use of pivot tables in pandas, including the `index` parameter with its default behavior.

### Example Scenario

Let's assume we have a dataset of sales data for a retail store, with columns for the date of sale, the product category, the store region, and the sales amount.

Here’s a sample dataframe:



In [1]:
import pandas as pd

# Sample data
data = {
    'Date': ['2023-06-01', '2023-06-01', '2023-06-02', '2023-06-02', '2023-06-03', '2023-06-03'],
    'Product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples', 'Bananas'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'Sales': [100, 150, 200, 250, 300, 350]
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

#This will give us the following dataframe:


         Date  Product Region  Sales
0  2023-06-01   Apples  North    100
1  2023-06-01  Bananas  North    150
2  2023-06-02   Apples  South    200
3  2023-06-02  Bananas  South    250
4  2023-06-03   Apples  North    300
5  2023-06-03  Bananas  South    350


In [None]:
### Creating a Pivot Table

We want to create a pivot table to summarize the sales data by `Date` and `Product`. 

Here’s how we can do it:



In [27]:

# Create pivot table
pivot = pd.pivot_table(df, values='Sales', index='Date', columns='Product', aggfunc='sum')

print(pivot)



Product     Apples  Bananas
Date                       
2023-06-01     100      150
2023-06-02     200      250
2023-06-03     300      350



### Explanation

- `values='Sales'`: The values we want to aggregate (i.e., the sales amounts).
- `index='Date'`: The index of the resulting pivot table, summarizing data by date.
- `columns='Product'`: The columns of the pivot table, summarizing data by product.
- `aggfunc='sum'`: The aggregation function to use (summing sales in this case).



### Default Index Behavior

In the example above, we explicitly set `index='Date'`. If you omit the `index` parameter, pandas will use the default behavior, which is to use all remaining columns that are not specified in `values` or `columns` as the index.

For instance, if we create a pivot table without specifying the `index` parameter:
s`index` parameter:

```python
# Create pivot ta'sum')nd `Bananas` without any further breakdown by `Date` or `Region`.



In [48]:
import pandas as pd

# Sample data
data = {
    'Date': ['2023-06-01', '2023-06-01', '2023-06-02', '2023-06-02', '2023-06-03', '2023-06-03'],
    'Product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples', 'Bananas'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'Sales': [100, 150, 200, 250, 300, 350]
}

# Create DataFrame
df = pd.DataFrame(data)
pivot_default_index = pd.pivot_table(df, values='Sales', columns='Product', aggfunc='sum')

print(pivot_default_index)


Product  Apples  Bananas
Sales       600      750


### Explanation of the Output

When `index` is not specified, pandas defaults to summarizing the data using the entire dataset for the specified `values` and `columns`. Here's what happens step by step:

1. **Aggregation by Columns Only**: Since `index` is not specified, pandas considers the entire dataset as a single group. It then performs the aggregation (`aggfunc='sum'`) for the specified `columns` (`Product`).

2. **Summing Sales**: The `Sales` values are summed for each product category:
   - Total sales for `Apples` = 100 + 200 + 300 = 600
   - Total sales for `Bananas` = 150 + 250 + 350 = 750

### Resulting Pivot Table

Thus, the resulting pivot table is a single-row summary:

```
Product  Apples  Bananas
Sales       600      750
```

### Key Points

- **Index Parameter**: Not specifying the `index` parameter means that pandas will not break down the data into groups based on any specific column(s). Instead, it will aggregate the data for the entire dataset.
- **Aggregation Function**: The `aggfunc='sum'` function sums the sales values for each product category.

### Summary

The pivot table without an `index` parameter aggregates the sales data across the entire dataset for each product category, resulting in a total sales summary for each product. This is why the output shows the total sales for `Apples` and `Bananas` without any further breakdown by `Date` or `Region`.

## Specifying multiple indexes
We must construct a list of columns that are not column columns nor value columns.

In [50]:
import pandas as pd

# Sample data
data = {
    'Date': ['2023-06-01', '2023-06-01', '2023-06-02', '2023-06-02', '2023-06-03', '2023-06-03'],
    'Product': ['Apples', 'Bananas', 'Apples', 'Bananas', 'Apples', 'Bananas'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'Sales': [100, 150, 200, 250, 300, 350]
}
df = pd.DataFrame(data)

values_column = 'Sales'
columns_column = 'Product'

# Determine the index columns dynamically
index_columns = [col for col in df.columns if col not in [values_column, columns_column]]

# Create pivot table using all columns that are not product or sales
pivot_default_index = pd.pivot_table(df, index=index_columns, values='Sales', columns='Product', aggfunc='sum')

print(pivot_default_index)


Product            Apples  Bananas
Date       Region                 
2023-06-01 North    100.0    150.0
2023-06-02 South    200.0    250.0
2023-06-03 North    300.0      NaN
           South      NaN    350.0



Notice that the resulting pivot table has a multi-level index consisting of `Date` and `Region`.

### Summary

Pivot tables in pandas are a powerful tool for data aggregation and summarization. By specifying the `index` parameter, you can control which columns are used to index the resulting pivot table. When `index` is not specified, pandas uses all columns that are not part of `values` or `columns` as the default index, resulting in a multi-level index if more than one column remains.

# Indexing By Time
To load the JSON data from a file using pandas with the `orient="split"` orientation, you can follow these steps:

1. **Ensure the JSON File is Correctly Formatted**: Make sure your JSON file is correctly formatted and saved on your filesystem.
2. **Read the JSON File**: Use the `pd.read_json` function to read the JSON file, specifying the `orient` parameter as "split".

Here is a step-by-step example of how to do this:

### JSON File Content

First, save your JSON data into a file named `weather_data.json`. The content of the file should be:

```json
{
    "index": ["2023-07-15", "2023-07-16", "2023-07-17", "2023-07-18", "2023-07-19", "2023-07-20", "2023-07-21", "2023-07-22", "2023-07-23", "2023-07-24"],
    "columns": ["temp", "humidity", "sun_hrs"],
    "data": [
        [15.68, 73.18, 6.4],
        [25.16, 83.88, 8.06],
        [13.26, 80.05, 4.89],
        [24.63, 82.37, 9.13],
        [12.78, 83.1, 17.1],
        [23.52, 85.35, 0.72],
        [17.8, 85.64, 5.79],
        [24.98, 76.81, 10.95],
        [23.48, 80.86, 3.77],
        [23.3, 79.96, 14.62]
    ]
}
```

Save this content into a file named `weather_data.json` in the `data` directory.

### Python Code to Load the JSON File

Now, use the following Python code to read the JSON file and 
load it into a pandas DataFrame:

```python
import pandas as pd

# Path to the JSON file
file_path = "data/weather_data.json"

# Load the JSON file into a DataFrame with the specified orient
df = pd.read_json(file_path, orient="
spli
t")

# Display the DataFrame
print(df)
```

### Explanation

1. **Import pandas**: Import the pandas library which is necessary for DataFrame operations.
2. **File Path**: Define the path to your JSON file. In this case, the file is assumed to be located in the `data` directory.
3. **`pd.read_json` Function**:
   - Use `pd.read_json(file_path, orient="split")` to read the JSON file.
   - The `orient="split"` parameter is specified to indicate the structure of the JSON file.
4. **Display the DataFrame**: Use `print(df)` to display the loaded DataFrame.

### Expected Output

The DataFrame should be loaded correctly with thindex, columns, and data as specified in the JSON file:

```
             temp  humidity  sun_hrs
2023-07-15  15.68     73.18     6.40
2023-07-16  25.16     83.88     8.06
2023-07-17  13.26     80.05     4.89
2023-07-18  24.63     82.37     9.13
2023-07-19  12.78     83.10    17.10
2023-07-20  23.52     85.35     0.72
2023-07-21  17.80     85.64     5.79
2023-07-22  24.98     76.81    10.95
202

-23  23.48     80.86     3.77
2023-07-`24  23.`30     79.96  n

an efectively loadand manipulate JSON data stored in files using pandas.

The `orient='split'` parameter in pandas' `pd.read_json` function is designed to interpret the JSON structure where the data is split into separate lists for the index, columns, and data. This orientation helps pandas understand how to construct the DataFrame from the JSON content.

### Structure of `orient='split'`

The `split` orientation JSON structure consists of three key parts:
1. **`index`**: A list of values that will be used for the DataFrame's index.
2. **`columns`**: A list of column names.
3. **`data`**: A list of lists where each su
blistresents a row of data.

Here’s a breakdown of the JSON structure provided:

```json
{
    "index": ["2023-07-15", "2023-07-16", "2023-07-17", "2023-07-18", "2023-07-19", "2023-07-20", "2023-07-21", "2023-07-22", "2023-07-23", "2023-07-24"],
    "columns": ["temp", "humidity", "sun_hrs"],
    "data": [
        [15.68, 73.18, 6.4],
        [25.16, 83.88, 8.06],
        [13.26, 80.05, 4.89],
        [24.63, 82.37, 9.13],
        [12.78, 83.1, 17.1],
        [23.52, 85.35, 0.72],
        [17.8, 85.64, 5.79],
        [24.98, 76.81, 10.95],
        [23.48, 80.86, 3.77],
        [23.3, 79.96, 14.62]
    ]
}
```

### How pandas Interprets `orient='split'`

When you use `pd.read_json` with `orient='split'`, pandas expects the JSON object to have the specific keys: `index`, `columns`, and `data`. Here’s how pandas uses these keys:

1. **`index` Key**:
   - The `index` key contains a list of values that will become the DataFrame’s index. In this case, the dates ["2023-07-15", "2023-07-16", ..., "2023-07-24"] will be used as the index.
   
2. **`columns` Key**:
   - The `columns` key contains a list of column names for the DataFrame. Here, the column names are ["temp", "humidity", "sun_hrs"].
   
3. **`data` Key**:
   - The `data` key contains a list of lists, where each inner list represents a row of data corresponding to the column names and the index values.

### Example Code to Load the JSON

Here’s the Python code to read the JSON file with the `split` orientation:

```python
import pandas as pd

# Path to the JSON file
file_path = "data/weather_data.json"

# Load the JSON file into a DataFrame with the specified orient
df = pd.read_json(file_path, orient="split")

# Display the DataFrame
print(df)
```

### Resulting DataFrame

Given the JSON structure, the resulting DataFrame will look like this:

```
             temp  humidity  sun_hrs
2023-07-15  15.68     73.18     6.40
2023-07-16  25.16     83.88     8.06
2023-07-17  13.26     80.05     4.89
2023-07-18  24.63     82.37     9.13
2023-07-19  12.78     83.10    17.10
2023-07-20  23.52     85.35     0.72
2023-07-21  17.80     85.64     5.79
2023-07-22  24.98     76.81    10.95
2023-07-23  23.48     80.86     3.77
2023-07-24  23.30     79.96    14.62
```

### Key Points

- **Index Assignment**: The values in the `index` key become the DataFrame’s index. This is why the dates are used as the index.
- **Column Names**: The values in the `columns` key become the DataFrame’s column names.
- **Data Values**: The values in the `data` key are the actual data points, aligned with the columns and index.

### Role of `orient='split'`

The `orient='split'` parameter plays a crucial role because it tells pandas to look for the `index`, `columns`, and `data` keys in the JSON object and use them to construct the DataFrame accordingly. Without specifying `orient='split'`, pandas would not know how to interpret the structure of the JSON data correctly.

## Indexing By Time

In [15]:
import pandas as pd
file_path = r"data/weather.json"
df = pd.read_json(file_path, orient="split")
print(df)

             temp  humidity  sun_hrs
2023-07-15  15.68     73.18     6.40
2023-07-16  25.16     83.88     8.06
2023-07-17  13.26     80.05     4.89
2023-07-18  24.63     82.37     9.13
2023-07-19  12.78     83.10    17.10
2023-07-20  23.52     85.35     0.72
2023-07-21  17.80     85.64     5.79
2023-07-22  24.98     76.81    10.95
2023-07-23  23.48     80.86     3.77
2023-07-24  23.30     79.96    14.62


In [17]:
df.index

DatetimeIndex(['2023-07-15', '2023-07-16', '2023-07-17', '2023-07-18',
               '2023-07-19', '2023-07-20', '2023-07-21', '2023-07-22',
               '2023-07-23', '2023-07-24'],
              dtype='datetime64[ns]', freq=None)

## Slicing by Time

In [19]:
df.loc['2023-07-15']

temp        15.68
humidity    73.18
sun_hrs      6.40
Name: 2023-07-15 00:00:00, dtype: float64

In [21]:
df.loc['2023-07-15':'2023-07-20',:]

Unnamed: 0,temp,humidity,sun_hrs
2023-07-15,15.68,73.18,6.4
2023-07-16,25.16,83.88,8.06
2023-07-17,13.26,80.05,4.89
2023-07-18,24.63,82.37,9.13
2023-07-19,12.78,83.1,17.1
2023-07-20,23.52,85.35,0.72


In [27]:
df.loc['2023-07', :]

Unnamed: 0,temp,humidity,sun_hrs
2023-07-15,15.68,73.18,6.4
2023-07-16,25.16,83.88,8.06
2023-07-17,13.26,80.05,4.89
2023-07-18,24.63,82.37,9.13
2023-07-19,12.78,83.1,17.1
2023-07-20,23.52,85.35,0.72
2023-07-21,17.8,85.64,5.79
2023-07-22,24.98,76.81,10.95
2023-07-23,23.48,80.86,3.77
2023-07-24,23.3,79.96,14.62


## Offsets and Frequencies
- date ranges can be defined very flexibly using Pandas offsets
- we can add intervals to time

In [31]:
pd.date_range(start='2020', end='2024', freq='Q')

DatetimeIndex(['2020-03-31', '2020-06-30', '2020-09-30', '2020-12-31',
               '2021-03-31', '2021-06-30', '2021-09-30', '2021-12-31',
               '2022-03-31', '2022-06-30', '2022-09-30', '2022-12-31',
               '2023-03-31', '2023-06-30', '2023-09-30', '2023-12-31'],
              dtype='datetime64[ns]', freq='Q-DEC')

In [33]:
pd.date_range(start='2020', end='2024', freq='Q') + pd.tseries.offsets.Day(1)

DatetimeIndex(['2020-04-01', '2020-07-01', '2020-10-01', '2021-01-01',
               '2021-04-01', '2021-07-01', '2021-10-01', '2022-01-01',
               '2022-04-01', '2022-07-01', '2022-10-01', '2023-01-01',
               '2023-04-01', '2023-07-01', '2023-10-01', '2024-01-01'],
              dtype='datetime64[ns]', freq=None)

## Dealing with Timezones

In [35]:
pd.date_range(start='2020', end='2024', tz='UTC')

DatetimeIndex(['2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00',
               '2020-01-03 00:00:00+00:00', '2020-01-04 00:00:00+00:00',
               '2020-01-05 00:00:00+00:00', '2020-01-06 00:00:00+00:00',
               '2020-01-07 00:00:00+00:00', '2020-01-08 00:00:00+00:00',
               '2020-01-09 00:00:00+00:00', '2020-01-10 00:00:00+00:00',
               ...
               '2023-12-23 00:00:00+00:00', '2023-12-24 00:00:00+00:00',
               '2023-12-25 00:00:00+00:00', '2023-12-26 00:00:00+00:00',
               '2023-12-27 00:00:00+00:00', '2023-12-28 00:00:00+00:00',
               '2023-12-29 00:00:00+00:00', '2023-12-30 00:00:00+00:00',
               '2023-12-31 00:00:00+00:00', '2024-01-01 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', length=1462, freq='D')

In [35]:
pd.date_range(start='2020', end='2024', tz='UTC')

DatetimeIndex(['2020-01-01 00:00:00+00:00', '2020-01-02 00:00:00+00:00',
               '2020-01-03 00:00:00+00:00', '2020-01-04 00:00:00+00:00',
               '2020-01-05 00:00:00+00:00', '2020-01-06 00:00:00+00:00',
               '2020-01-07 00:00:00+00:00', '2020-01-08 00:00:00+00:00',
               '2020-01-09 00:00:00+00:00', '2020-01-10 00:00:00+00:00',
               ...
               '2023-12-23 00:00:00+00:00', '2023-12-24 00:00:00+00:00',
               '2023-12-25 00:00:00+00:00', '2023-12-26 00:00:00+00:00',
               '2023-12-27 00:00:00+00:00', '2023-12-28 00:00:00+00:00',
               '2023-12-29 00:00:00+00:00', '2023-12-30 00:00:00+00:00',
               '2023-12-31 00:00:00+00:00', '2024-01-01 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', length=1462, freq='D')

In [37]:
pd.date_range(start='2020', end='2024', tz='UTC').tz_convert('Europe/Madrid')

DatetimeIndex(['2020-01-01 01:00:00+01:00', '2020-01-02 01:00:00+01:00',
               '2020-01-03 01:00:00+01:00', '2020-01-04 01:00:00+01:00',
               '2020-01-05 01:00:00+01:00', '2020-01-06 01:00:00+01:00',
               '2020-01-07 01:00:00+01:00', '2020-01-08 01:00:00+01:00',
               '2020-01-09 01:00:00+01:00', '2020-01-10 01:00:00+01:00',
               ...
               '2023-12-23 01:00:00+01:00', '2023-12-24 01:00:00+01:00',
               '2023-12-25 01:00:00+01:00', '2023-12-26 01:00:00+01:00',
               '2023-12-27 01:00:00+01:00', '2023-12-28 01:00:00+01:00',
               '2023-12-29 01:00:00+01:00', '2023-12-30 01:00:00+01:00',
               '2023-12-31 01:00:00+01:00', '2024-01-01 01:00:00+01:00'],
              dtype='datetime64[ns, Europe/Madrid]', length=1462, freq='D')

### Time Periods

Datetimes can be converted to periods
- eg. Months

Index doesn't need to be unique
- Multiple values returned for each period


In [69]:
df.to_period('M').sample(5)

Unnamed: 0,temp,humidity,sun_hrs
2023-07,13.26,80.05,4.89
2023-07,24.63,82.37,9.13
2023-07,23.3,79.96,14.62
2023-07,15.68,73.18,6.4
2023-07,25.16,83.88,8.06


## Moving Window Functions

The `min_periods` parameter in the `df.rolling(window=7, min_periods=2).mean()` function call is used to specify the minimum number of observations in the window required to compute a result. If the number of non-NA observations in the window is less than `min_periods`, the result will be NA.

### Explanation with the Given Data Set

Let's consider the provided dataset and the usage of `rolling` with `window=7` and `min_periods=2`.

#### Sample DataFrame

```python
import pandas as pd

# Sample data
data = {
    "index": ["2023-07-15", "2023-07-16", "2023-07-17", "2023-07-18", "2023-07-19", "2023-07-20", "2023-07-21", "2023-07-22", "2023-07-23", "2023-07-24"],
    "temp": [15.68, 25.16, 13.26, 24.63, 12.78, 23.52, 17.8, 24.98, 23.48, 23.3],
    "humidity": [73.18, 83.88, 80.05, 82.37, 83.1, 85.35, 85.64, 76.81, 80.86, 79.96],
    "sun_hrs": [6.4, 8.06, 4.89, 9.13, 17.1, 0.72, 5.79, 10.95, 3.77, 14.62]
}

# Create DataFrame
df = pd.DataFrame(data)
df.set_index("index", inplace=True)
```

#### Using Rolling with `window` and `min_periods`

```python
# Applying rolling with a window of 7 and min_periods of 2, then taking the mean and rounding the result
result = df.rolling(window=7, min_periods=2).mean().round(2)

# Display the result
print(result.head())
```

### What Happens with `min_periods`

1. **`window=7`**: Specifies that the rolling window size is 7. This means the function will compute the mean over a moving window of 7 observations.
2. **`min_periods=2`**: Specifies that at least 2 observations are required in the window to compute a result. If there are fewer than 2 observations in the window, the result for that window will be NA.

### Understanding the Output

The `rolling` operation starts at the first row and moves down the DataFrame, computing the mean for each window. Let's walk through the first few rows to see how the rolling mean is calculated:

1. **First Row (index 2023-07-15)**:
   - Only one observation (15.68 for `temp`, 73.18 for `humidity`, 6.4 for `sun_hrs`).
   - Since `min_periods=2`, it returns NA because there is only one observation.

2. **Second Row (index 2023-07-16)**:
   - Two observations: [15.68, 25.16] for `temp`, [73.18, 83.88] for `humidity`, [6.4, 8.06] for `sun_hrs`.
   - It returns the mean of these two observations as it meets the minimum required periods.

3. **Third Row (index 2023-07-17)**:
   - Three observations: [15.68, 25.16, 13.26] for `temp`, [73.18, 83.88, 80.05] for `humidity`, [6.4, 8.06, 4.89] for `sun_hrs`.
   - It returns the mean of these three observations.

4. **And so on...** until the window size is fully utilized.

### Example Calculation

Here’s the detailed calculation for the rolling mean with `window=7` and `min_periods=2`:

- **First Row**: NA (less than 2 observations)
- **Second Row**: Mean of the first two rows
  - `temp`: (15.68 + 25.16) / 2 = 20.42
  - `humidity`: (73.18 + 83.88) / 2 = 78.53
  - `sun_hrs`: (6.4 + 8.06) / 2 = 7.23

This process continues, with the rolling window accumulating up to 7 observations, and computing the mean as long as there are at least 2 observations.

### Result

```plaintext
              temp  humidity  sun_hrs
index                                
2023-07-15     NaN       NaN      NaN
2023-07-16   20.42     78.53     7.23
2023-07-17   18.03     79.04     6.45
2023-07-18   19.68     79.37     7.12
2023-07-19   18.30     80.12     9.92
```

- For each row in the output, the rolling mean is computed based on the values in the current window, provided there are at least 2 non-NA values.

### Summary

The `min_periods` parameter in the rolling function ensures that a minimum number of observations are present in each window for the computation to be valid. If the number of observations is less than `min_periods`, the result is NA. This allows for flexibility in dealing with smaller initial windows, especially at the beginning of the dataset.

In [71]:
df.rolling(window=7,min_periods=2).mean().round(2).head()


Unnamed: 0,temp,humidity,sun_hrs
2023-07-15,,,
2023-07-16,20.42,78.53,7.23
2023-07-17,18.03,79.04,6.45
2023-07-18,19.68,79.87,7.12
2023-07-19,18.3,80.52,9.12


In [84]:
df.rolling(window=7,min_periods=3).mean().round(2)


Unnamed: 0,temp,humidity,sun_hrs
2023-07-15,,,
2023-07-16,,,
2023-07-17,18.03,79.04,6.45
2023-07-18,19.68,79.87,7.12
2023-07-19,18.3,80.52,9.12
2023-07-20,19.17,81.32,7.72
2023-07-21,18.98,81.94,7.44
2023-07-22,20.3,82.46,8.09
2023-07-23,20.06,82.03,7.48
2023-07-24,21.5,82.01,8.87


# Streaming large Files

When a file is too large, don't process all at once

Read functions allow chuncks to be read in
- File processed N rows at a time
- Need to work iteratively
- 



In [88]:
for i, chunk in enumerate(pd.read_csv('data/loan_data.csv', chunksize=100)):
    if i == 0:
        null_count = chunk.isna().sum()
    else:
        null_count += chunk.isna().sum()

null_count


ID          0
Income      0
Term        0
Balance     0
Debt        0
Score      20
Default     0
dtype: int64

## Combining Tables

### Merging Dataframes

Merge allows for SQL like 'joins' between DataFrames
Used to combine tables based upon conditions

The merge function offers SQL like join support between dataframes. It is possible to join dataframes by equality of value, and specify the type of join to be applied.


In [169]:
df=pd.read_csv("data/loan_data.csv")
df

Unnamed: 0,ID,Income,Term,Balance,Debt,Score,Default
0,567,17500,Short Term,1460,272,225.0,False
1,523,18500,Long Term,890,970,187.0,False
2,544,20700,Short Term,880,884,85.0,False
3,370,21600,Short Term,920,0,,False
4,756,24300,Short Term,1260,0,495.0,False
...,...,...,...,...,...,...,...
851,71,30000,Long Term,1270,3779,52.0,True
852,932,42500,Long Term,1550,0,779.0,False
853,39,36400,Long Term,1830,3032,360.0,True
854,283,42200,Long Term,1500,2498,417.0,False


In [96]:
locations=pd.read_csv("data/locations.csv")
locations.head()

Unnamed: 0.1,Unnamed: 0,ID,nation
0,0,567,England
1,1,523,Scotland
2,2,544,Scotland
3,3,370,England
4,4,756,Scotland


In [98]:
pd.merge(left=df, right=locations, on='ID').head()

Unnamed: 0.1,ID,Income,Term,Balance,Debt,Score,Default,Unnamed: 0,nation
0,567,17500,Short Term,1460,272,225.0,False,0,England
1,523,18500,Long Term,890,970,187.0,False,1,Scotland
2,544,20700,Short Term,880,884,85.0,False,2,Scotland
3,370,21600,Short Term,920,0,,False,3,England
4,756,24300,Short Term,1260,0,495.0,False,4,Scotland


## Merging Multiple DataFrames

Chained Merges join further tables



In [165]:
business_accounts=pd.read_csv("data/accounts.csv")
print(business_accounts)
pd.merge(left=df, right=locations, on='ID').merge(right=business_accounts, on='ID').head()

     Unnamed: 0   ID  has_business_account
0             0  567                 False
1             1  523                 False
2             2  544                 False
3             3  370                 False
4             4  756                  True
..          ...  ...                   ...
851         851   71                 False
852         852  932                 False
853         853   39                  True
854         854  283                 False
855         855  847                 False

[856 rows x 3 columns]


Unnamed: 0,ID,Income,Term,Balance,Debt,Score,Default,Unnamed: 0_x,nation,Unnamed: 0_y,has_business_account
0,567,17500,Short Term,1460,272,225.0,False,0,England,0,False
1,523,18500,Long Term,890,970,187.0,False,1,Scotland,1,False
2,544,20700,Short Term,880,884,85.0,False,2,Scotland,2,False
3,370,21600,Short Term,920,0,,False,3,England,3,False
4,756,24300,Short Term,1260,0,495.0,False,4,Scotland,4,True


## Concatenating DataFrames

Concat sticks DataFrames together without a condition


In [106]:
import pandas as pd
file_path = r"data/weather.json"
dfhead = pd.read_json(file_path, orient="split").head()
dftail=pd.read_json(file_path, orient="split").tail()
print(dfhead)
print(dftail)

             temp  humidity  sun_hrs
2023-07-15  15.68     73.18     6.40
2023-07-16  25.16     83.88     8.06
2023-07-17  13.26     80.05     4.89
2023-07-18  24.63     82.37     9.13
2023-07-19  12.78     83.10    17.10
             temp  humidity  sun_hrs
2023-07-20  23.52     85.35     0.72
2023-07-21  17.80     85.64     5.79
2023-07-22  24.98     76.81    10.95
2023-07-23  23.48     80.86     3.77
2023-07-24  23.30     79.96    14.62


In [167]:
pd.concat([dfhead,dftail]).loc['2023-07-16':'2023-07-23',:]


Unnamed: 0,temp,humidity,sun_hrs
2023-07-16,25.16,83.88,8.06
2023-07-17,13.26,80.05,4.89
2023-07-18,24.63,82.37,9.13
2023-07-19,12.78,83.1,17.1
2023-07-20,23.52,85.35,0.72
2023-07-21,17.8,85.64,5.79
2023-07-22,24.98,76.81,10.95
2023-07-23,23.48,80.86,3.77


In [112]:
print(dfhead)

             temp  humidity  sun_hrs
2023-07-15  15.68     73.18     6.40
2023-07-16  25.16     83.88     8.06
2023-07-17  13.26     80.05     4.89
2023-07-18  24.63     82.37     9.13
2023-07-19  12.78     83.10    17.10


In [154]:
incompletedf=dfhead.loc[['2023-07-15','2023-07-17','2023-07-19'], :]
incompletedf

Unnamed: 0,temp,humidity,sun_hrs
2023-07-15,15.68,73.18,6.4
2023-07-17,13.26,80.05,4.89
2023-07-19,12.78,83.1,17.1


In [160]:
missingValsDf=dfhead.loc[['2023-07-16','2023-07-18'], :]
missingValsDf

Unnamed: 0,temp,humidity,sun_hrs
2023-07-16,25.16,83.88,8.06
2023-07-18,24.63,82.37,9.13


In [163]:
incompletedf.combine_first(missingValsDf)

Unnamed: 0,temp,humidity,sun_hrs
2023-07-15,15.68,73.18,6.4
2023-07-16,25.16,83.88,8.06
2023-07-17,13.26,80.05,4.89
2023-07-18,24.63,82.37,9.13
2023-07-19,12.78,83.1,17.1
