# Merge CSV files based on a shared column

Purpose: This script will allow us to take two CSV files and combine them using a shared key.

## 1. Import modules

In [7]:
import pandas as pd

## 2. Read the two CSV files and convert them to pandas DataFrames

In [9]:
df1 = pd.read_csv('sample-csvs/metadata-missing-imageLinks.csv')
df2 = pd.read_csv('sample-csvs/imageLinks.csv')

Let's take a look at the CSVs

In [3]:
df1.head(3)

Unnamed: 0,Title,Manifest
0,Basement structures in Ohio,https://library.osu.edu/dc/concern/generic_wor...
1,"Bedrock geology of the Flint Ridge area, Licki...",https://library.osu.edu/dc/concern/generic_wor...
2,Bouguer anomalies in Ohio,https://library.osu.edu/dc/concern/generic_wor...


The columns are being truncated and are hard to read. The following code fixes that:

In [12]:
pd.set_option('display.max_colwidth', None) 

Let's take a look at the CSVs again. Both files have a column called `Manifest`. However, they are not in the same order.

In [15]:
df1.head(3)

Unnamed: 0,Title,Manifest
0,Basement structures in Ohio,https://library.osu.edu/dc/concern/generic_works/hq37w186h/manifest.json
1,"Bedrock geology of the Flint Ridge area, Licking and Muskingum Counties, Ohio",https://library.osu.edu/dc/concern/generic_works/05742538x/manifest.json
2,Bouguer anomalies in Ohio,https://library.osu.edu/dc/concern/generic_works/cc08hv396/manifest.json


In [17]:
df2.head(3)

Unnamed: 0,Manifest,Image
0,https://library.osu.edu/dc/concern/generic_works/cc08hv396/manifest.json,https://library.osu.edu/dc/downloads/1257b6246?file=thumbnail
1,https://library.osu.edu/dc/concern/generic_works/05742538x/manifest.json,https://library.osu.edu/dc/downloads/6w924r993?file=thumbnail
2,https://library.osu.edu/dc/concern/generic_works/hq37w186h/manifest.json,https://library.osu.edu/dc/downloads/zc77t401n?file=thumbnail


## 3. Merge the files

We want a single CSV that has all three columns for `Title`, `Manifest`, and `Image`. For this task, we can call the pandas function `merge` and specify the column `manifest` as the matching field.

In [19]:
# Merge the two dataframes based on the "Manifest" column
merged_df = pd.merge(df1, df2, on='Manifest')

In [8]:
merged_df.head(3)

Unnamed: 0,Title,Manifest,Image
0,Basement structures in Ohio,https://library.osu.edu/dc/concern/generic_works/hq37w186h/manifest.json,https://library.osu.edu/dc/downloads/zc77t401n?file=thumbnail
1,"Bedrock geology of the Flint Ridge area, Licking and Muskingum Counties, Ohio",https://library.osu.edu/dc/concern/generic_works/05742538x/manifest.json,https://library.osu.edu/dc/downloads/6w924r993?file=thumbnail
2,Bouguer anomalies in Ohio,https://library.osu.edu/dc/concern/generic_works/cc08hv396/manifest.json,https://library.osu.edu/dc/downloads/1257b6246?file=thumbnail


In [10]:
# Write the merged dataframe to a new CSV file
merged_df.to_csv('merged_file.csv', index=False)

Look for the CSV file in this directory and inspect it in a text or spreadsheet editor.