In [None]:
import pandas as pd
import numpy as np
import os

# Lecture 6 – Concatenating and Merging

## DSC 80, Spring 2023

### Agenda

- Aside: Time series data.
- Concatenating DataFrames vertically and horizontally.
- Merging.
    - Types of joins.
    - Many-to-one and many-to-many joins.

## Aside: Working with time series data

### Time series – why now?

- We're about to start looking at how to combine multiple DataFrames.
- Data is often partitioned by time. For instance, there may be one `.csv` file per day for 1 year.
- We will need to load in the files as DataFrames and `pd.concat` them together.
- Note: "time series" is a general term and is not related to Series in `pandas`.

### Datetime types

When working with time data, you will see two different kinds of "times":

* **Datetimes** reference particular moments in time (e.g. November 26th, 1998 at 8:26AM).
    - Could just be a date, e.g. January 23, 2023.
    - Could just be a time, e.g. 4:45 AM.
    - Datetimes typically don't keep track of timezones.

* **Timedeltas**, or durations, reference an exact length of time (e.g. a duration of 3 hours).

### The `datetime` module

Python has an in-built `datetime` module, which contains `datetime` and `timedelta` types. These are much more convenient to deal with than strings that contain times.

In [None]:
import datetime

In [None]:
datetime.datetime.now()

In [None]:
datetime.datetime.now() + datetime.timedelta(days=3, hours=5)

Unix timestamps count the number of seconds since January 1st, 1970.

In [None]:
datetime.datetime.now().timestamp()

### Times in `pandas`

- `pd.Timestamp` is the `pandas` equivalent of `datetime`.
- `pd.to_datetime` converts strings to `pd.Timestamp` objects.

In [None]:
pd.Timestamp(year=1998, month=11, day=26)

In [None]:
final_start = pd.to_datetime('14th June, 2023, 11:30AM')
final_start

In [None]:
final_finish = pd.to_datetime('June 14th, 2023, 2:30PM')
final_finish

Timestamps have time-related attributes, e.g. `dayofweek`, `hour`, `min`, `sec`.

In [None]:
# 0 is Monday, 1 is Tuesday, etc.
final_finish.dayofweek

In [None]:
final_finish.hour

Subtracting timestamps yields `pd.Timedelta` objects.

In [None]:
final_finish - final_start

### Example: Exam speeds

Below, we have the Final Exam starting and ending times for two sections of a course.

In [None]:
exam_times = pd.read_csv(os.path.join('data', 'exam-times.csv'))
exam_times

**Question:** Who took the longest time to finish the exam?

In [None]:
# Step 1: Convert the time columns to timestamps, using pd.to_datetime.
exam_times['start_exam'] = pd.to_datetime(exam_times['start_exam'])
exam_times['finish_exam'] = pd.to_datetime(exam_times['finish_exam'])
exam_times

In [None]:
# Note that datetime64[ns] is the data type pandas uses to store timestamps in a Series/DataFrame.
exam_times.dtypes

In [None]:
# Step 2: Find the difference between the two time columns.
exam_times['difference'] = exam_times['finish_exam'] - exam_times['start_exam']
exam_times

In [None]:
exam_times.dtypes

In [None]:
# Step 3: Sort by the difference in descending order and take the first row.
exam_times.sort_values('difference', ascending=False)['name'].iloc[0]

## Concatenating vertically

In [None]:
# Run this cell to set up the next example.

section_A = pd.DataFrame({
    'Name': ['Annie', 'Billy', 'Sally', 'Tommy'],
    'Midterm': [98, 82, 23, 45],
    'Final': [88, 100, 99, 67]
})

section_B = pd.DataFrame({
    'Name': ['Junior', 'Rex', 'Flash'],
    'Midterm': [70, 99, 81],
    'Final': [42, 25, 90]
})

section_C = pd.DataFrame({
    'Name': ['Justin', 'Marina'],
    'Final': [98, 52]
})

section_D = pd.DataFrame({
    'Midterm': [10, 30, 80],
    'Name': ['Janine', 'Sooh', 'Suraj']
})

### Example: Grades

Consider the students from our previous example. Suppose their grades are given to us in separate DataFrames. Note that these DataFrames contain the same attributes, but for different individuals.

In [None]:
section_A

In [None]:
section_B

**Question**: How do we combine both DataFrames into a single, larger DataFrame?

### Concatenating vertically

<center><img src="imgs/merging_append3.png" width="30%"></center>




* The `pd.concat` function combines DataFrame and Series objects.
* By default, the **rows of objects are stacked on top of one another**.
* `pd.concat` has many options; we'll learn some of them here, and you'll discover the others by reading the documentation.

### Example: Grades

By default, `pd.concat` takes a list of DataFrames and stacks them row-wise, i.e. on top of one another.

In [None]:
section_A

In [None]:
section_B

In [None]:
pd.concat([section_A, section_B])

Setting the optional argument `ignore_index` to `True` fixes the index (which `.reset_index()` also could do).

In [None]:
pd.concat([section_A, section_B], ignore_index=True)

To keep track of which original DataFrame each row came from, we can use the `keys` optional argument, though if we do this, the resulting DataFrame has a `MultiIndex`.

In [None]:
combined = pd.concat([section_A, section_B], keys=['Section A', 'Section B'])
combined

In [None]:
combined.loc['Section A']

### Adding a single row

To add a single row to a DataFrame, create a new DataFrame that contains the single row, and use `pd.concat`.

*The DataFrame `append` method does exist, though it's deprecated.*

In [None]:
new_row_data = {'Name': 'King Triton', 'Midterm': 21, 'Final': 94}
new_row_df = pd.DataFrame([new_row_data]) # Note the list!
new_row_df

In [None]:
pd.concat([section_A, new_row_df])

### Missing columns?

If we concatenate two DataFrames that don't share the same column names, `NaN`s are added in the columns that aren't shared.

In [None]:
section_C

In [None]:
section_D

In [None]:
# Note that the 'Name' columns were combined, despite not being in the same position!
pd.concat([section_C, section_D])

### ⚠️ Warning: No loops!

- `pd.concat` returns a copy; it does not modify any of the input DataFrames.
- Do **not** use `pd.concat` in a loop, as it has terrible time and space efficiency.

```py
total = pd.DataFrame()
for df in dataframes:
    total = total.concat(df)
```

- Instead, use `pd.concat(dataframes)`, where `dataframes` is a list of DataFrames.

### Aside: Accessing file names programmatically

- At times, you'll need to load in all of the files in a given folder.
- `os.listdir(dirname)` returns a **list** of the names of the files in the folder `dirname`.

In [None]:
import os
os.listdir('data')

In [None]:
os.listdir('../')

The following does something similar, but in the shell.

In [None]:
!ls ../

## Concatenating horizontally

In [None]:
# Run this cell to set up the next example.

exams = section_A.copy()

assignments = exams[['Name']].assign(Homeworks=[99, 45, 23, 81],
                                     Labs=[100, 100, 99, 100])

overall = pd.DataFrame({
    'PID': ['A15253545', 'A10348245', 'A13349069', 'A18485824', 'A10094857'],
    'Student': ['Billy', 'Sally', 'Annie', 'Larry', 'Johnny'],
    'Final': [88, 64, 91, 45, 89]
})

### Example: Grades (again)

Suppose we have two DataFrames, `exams` and `assignments`, which both contain different attributes for the same individuals.

In [None]:
exams

In [None]:
assignments

If we try to combine these DataFrames with `pd.concat`, we don't quite get what we're looking for.

In [None]:
pd.concat([exams, assignments])

But that's where the `axis` argument becomes handy. 

Remember, most `pandas` operations default to `axis=0`, but here we want to concatenate the columns of `exams` to the columns of `assignments`, so we should use `axis=1`.

In [None]:
pd.concat([exams, assignments], axis=1)

Note that the `'Name'` column appears twice!

### Concatenating horizontally

<center><img src='imgs/merging_concat_series_ignore_index.png' width='50%'></center>

- To concatenate two DataFrames horizontally, use `pd.concat` with `axis=1`.
- **Concatenation is done by matching indexes, regardless of their order.** It does not look at the information in any of the columns!

Note that the call to `pd.concat` below combines information about each individual correctly, even though the orders of the names in `exams_by_name` and `assignments_by_name` are different.

In [None]:
# .loc[::-1] reverses the rows of the DataFrame.
exams_by_name = exams.set_index('Name').iloc[::-1]
exams_by_name

In [None]:
assignments_by_name = assignments.set_index('Name')
assignments_by_name

In [None]:
pd.concat([exams_by_name, assignments_by_name], axis=1)

Remember that `pd.concat` only looks at the index when combining rows, not at any other columns.

In [None]:
exams_reversed = exams.iloc[::-1].reset_index(drop=True)
exams_reversed

In [None]:
assignments

In [None]:
pd.concat([exams_reversed, assignments], axis=1)

### Summary: `pd.concat`

- `pd.concat` "stitches" two or more DataFrames together.
- If you use `axis=0`, the DataFrames are concatenated **vertically** based on column names (rows on top of rows).
- If you use `axis=1`, the DataFrames are concatenated **horizontally** based on row indexes (columns next to columns).

## Merging

### Joining

- `pd.concat` with `axis=1` combines DataFrames horizontally.
- To combine the rows of DataFrames in more advanced ways, we perform a **join** (SQL term), i.e. a **merge** (`pandas` term).
- A join is appropriate when we have two sources of information **about the same individuals** that is **linked by a common column**.
- The common column is called the **join key**.

In [None]:
# Run these two cells to set up the next example.

temps = pd.DataFrame({
    'City': ['San Diego', 'Toronto', 'Rome'],
    'Temperature': [76, 28, 56]
})

countries = pd.DataFrame({
    'City': ['Toronto', 'Shanghai', 'San Diego'],
    'Country': ['Canada', 'China', 'USA']
})

In [None]:
%reload_ext pandas_tutor

Let's work with a small example.

In [None]:
temps

In [None]:
countries

In [None]:
temps_city = temps.set_index('City')
temps_city

In [None]:
countries_city = countries.set_index('City')
countries_city

In [None]:
pd.concat([temps_city, countries_city],axis=1)

We'd like to combine both DataFrames, but it's not immediately clear if `pd.concat` would be useful.

It turns out that the right tool to use is the `merge` method.

In [None]:
%%pt

temps.merge(countries)

### The `merge` method

- The `merge` DataFrame method joins two tables by columns or indexes.
    - "Merge" is just `pandas`' word for "join".

- When using the `merge` method, the DataFrame before `.merge` is the "left" DataFrame, and the DataFrame passed into `.merge` is the "right" DataFrame.
    - In `temps.merge(countries)`, `temps` is considered the "left" DataFrame and `countries` is the "right" DataFrame; the columns from the left DataFrame appear to the left of the columns from right DataFrame.


- By default:
    - If join keys are not specified, all shared columns between the two DataFrames are used.
    - The "type" of join performed is an inner join.

### Join types: inner joins

- Note that `'Rome'` and `'Shanghai'` do not appear in the merged DataFrame.
- This is because there is:
    - no city named `'Rome'` in the right DataFrame, and
    - no city named `'Shanghai'` in the left DataFrame.
- The default type of join that `merge` performs is an **inner join**, which keeps the **intersection** of the join keys.


<center><img src='imgs/image_0.png' width=20%></center>

### Different join types

We can change the type of join performed by changing the `how` argument in `merge`. Let's experiment!

In [None]:
temps

In [None]:
countries

In [None]:
# The default value of how is 'inner'.
temps.merge(countries, how='inner')

In [None]:
# Note the NaNs!
temps.merge(countries, how='left')

In [None]:
temps.merge(countries, how='right')

In [None]:
%%pt

temps.merge(countries, how='outer')

Note that an outer join is what `pd.concat` performs by default, when there are no duplicated keys in either DataFrame.

In [None]:
pd.concat([temps.set_index('City'), countries.set_index('City')], axis=1)

### Different join types handle mismatches differently

There are four types of joins.

* **Inner:** keep **only** matching keys (intersection).
* **Outer:** keeps **all** keys in both DataFrames (union).
* **Left:** keep all keys in the left DataFrame, whether or not they are in the right DataFrame.
* **Right:** keep all keys in the right DataFrame, whether or not they are in the left DataFrame.

<center><img src='imgs/image_1.png' width=30%></center>

### Symmetry

Note that `a.merge(b, how='left')` contains the same information as `b.merge(a, how='right')`, just in a different order.

In [None]:
temps.merge(countries, how='left')

In [None]:
countries.merge(temps, how='right')

### Specifying join keys

- `pandas` defaults to using all shared column names as join keys.
- If there are multiple shared column names and you only want to join on one of them, **or** if there are no shared column names, then you will need to specify which columns to join on.
- Two solutions:
    1.  Use the `on` argument if the desired columns have the same names in both DataFrames.
    2. Use the `left_on` or `left_index` argument AND the `right_on` or `right_index` argument.

In [None]:
exams

In [None]:
overall

This is not what we're looking for:

In [None]:
exams.merge(overall)

Instead, we need to tell `pandas` to look in the `'Name'` column of `exams` and `'Student'` column of `overall`. 

In [None]:
exams.merge(overall, left_on='Name', right_on='Student')

If there are shared column names in the two DataFrames you are merging **that you are not using as join keys**, `'_x'` and `'_y'` are appended to their names by default.

In [None]:
exams.merge(overall, left_on='Name', right_on='Student', suffixes=('_Exam', '_Overall'))

## Many-to-one & many-to-many joins

### One-to-one joins

- So far in this lecture, the joins we have worked with are called **one-to-one** joins.
- Neither the left DataFrame nor the right DataFrame contained any duplicates in the join key.
- What if there are duplicated join keys, in one or both of the DataFrames we are merging?

In [None]:
# Run this cell to set up the next example.

profs = pd.DataFrame(
[['Brad', 'UCB', 9],
 ['Janine', 'UCSD', 8],
 ['Marina', 'UIC', 7],
 ['Justin', 'OSU', 5],
 ['Soohyun', 'UCSD', 2],
 ['Suraj', 'UCB', 2]],
    columns=['Name', 'School', 'Years']
)

schools = pd.DataFrame({
    'Abr': ['UCSD', 'UCLA', 'UCB', 'UIC'],
    'Full': ['University of California, San Diego', 'University of California, Los Angeles', 'University of California, Berkeley', 'University of Illinois Chicago']
})

programs = pd.DataFrame({
    'uni': ['UCSD', 'UCSD', 'UCSD', 'UCB', 'OSU', 'OSU'],
    'dept': ['Math', 'HDSI', 'COGS', 'CS', 'Math', 'CS'],
    'grad_students': [205, 54, 281, 439, 304, 193]
})

### Many-to-one joins

- Many-to-one joins are joins where **one** of the DataFrames contains duplicate values in the join key. 
- The resulting DataFrame will preserve those duplicate entries as appropriate. 

In [None]:
profs

In [None]:
schools

Note that when merging `profs` and `schools`, the information from `schools` is duplicated.
- `'University of California, San Diego'` appears twice.
- `'University of California, Berkeley'` appears three times.

In [None]:
# Why is a left merge most appropriate here?
profs.merge(schools, left_on='School', right_on='Abr', how='left')

### Many-to-many joins

Many-to-many joins are joins where both DataFrames have duplicate values in the join key.

In [None]:
profs

In [None]:
programs

Before running the following cell, try predicting the number of rows in the output.

In [None]:
%%pt

profs.merge(programs, left_on='School', right_on='uni')

- `merge` stitched together every UCSD row in `profs` with every UCSD row in `programs`. 
- Since there were 2 UCSD rows in `profs` and 3 in `programs`, there are $2 \cdot 3 = 6$ UCSD rows in the output. The same applies for all other schools.

## Summary, next time

### Summary

- Timestamps in `pandas` are stored using `pd.Timestamp` and `pd.Timedelta` objects.
- `pd.concat` "stitches" two or more DataFrames together, either vertically or horizontally.
    - Vertically: looks at column names. Horizontally: looks at row indexes.
- The `merge` DataFrame method **joins** two DataFrames together based on a shared column, called a join key. There are four types of joins:
    - Inner join: keeps the **intersection** of the join keys.
    - Outer join: keeps the **union** of the join keys.
    - Left/right joins: keeps all of the join keys in the left/right DataFrame.
    - In outer/left/right joins, all missing fields are filled with `NaN`s.

### Next time

Cleaning messy, real-world data.