# Data Linkage Checkpoint Answers

**Tian Lou** \
Ohio Education Research Center \
The Ohio State University

**Xiangyu Ren** \
New York University

**Anna-Carolina Haensch** \
University of Maryland \
LMU Munich

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10256875.svg)](https://doi.org/10.5281/zenodo.10256875)

**This notebook is developed for the [Data Literacy and Evidence Building Executive Class](https://www.socialdatascience.umd.edu/data-literacy).**

**The "Syntucky" data, which is synthetic in nature, is exclusively designed for training exercises. It is not intended to derive meaningful insights or make determinations about real-world populations.**

In [None]:
# Load libraries

# Interface to connect mySQL database server in Python
import MySQLdb 

# Library that provides lightweight disk-based database
import sqlite3 

# File system path
from pathlib import Path

# Data manipulation and analysis tool
import pandas as pd 

Before running the code below, please change <font color='red'> **YOUR DATA DIRECTORY**</font> to your own file path.

In [None]:
#Define data directory
data_directory = 'YOUR DATA DIRECTORY'

#master data
master_df = pd.read_csv(data_directory + 'master_crosssection.csv')

#employment data
employment_df = pd.read_csv(data_directory + 'employment_crosssection.csv')

#education data
education_df = pd.read_csv(data_directory + 'education_crosssection.csv')

Before running the code below, please change <font color='red'> **YOUR USERNAME**</font> to your username or use your own file path.

In [None]:
# Change your working path to your personal folder
Path('C:/Users/YOUR USERNAME/syn_data.db').touch() 

# Establish database connection
conn = sqlite3.connect('C:/Users/YOUR USERNAME/syn_data.db') 
c = conn.cursor()

In [None]:
#Remove the table if already exist
c.execute('''DROP TABLE IF EXISTS master ''')

#Create an empty table, "master", in your database
#In the code below, we define the column names and types before we can upload the master data to the database
c.execute('''CREATE TABLE master (id int, 
                                  gender text, 
                                  birth_year int, 
                                  birth_month text,
                                  urm_status text,
                                  race_group text,
                                  instate_origin  int)''')

In [None]:
#Load data in master_df to the database
master_df.to_sql('master', conn, if_exists = 'replace', index = False)

In [None]:
#Remove the table if already exist
c.execute('''DROP TABLE IF EXISTS employment ''')

#Create the employment table in the database
c.execute('''CREATE TABLE employment (id int,
                                      year7_max_qtrs_one_employer int, 
                                      year7_education_industry_employed int, 
                                      year7_ct_qtrs_employed int,
                                      year7_ct_employers int, 
                                      year7_earnings int, 
                                      year7_earnings_most_consistent_employer int)''')

In [None]:
#Load employment data to the database
employment_df.to_sql('employment', conn, if_exists = 'replace', index = False)

In [None]:
#Remove the table if already exist
c.execute('''DROP TABLE IF EXISTS education ''')

#Create the education table in the database
c.execute('''CREATE TABLE education (id int,
                                     first_enroll text,
                                     first_enroll_term text,
                                     first_enroll_calendaryear int,
                                     high_completion_acadyr int,
                                     high_completion_label text,
                                     high_completion text,
                                     year7_enrolled int)''')

In [None]:
#Load the education data to the database
education_df.to_sql('education',conn, if_exists = 'replace',index = False)

#### **Checkpoint 1: Join the Master Data and the Education Data**

Please join the master table with the education table by using SQL command `LEFT JOIN` and `INNER JOIN`. Check how the number of rows changes in the final DataFrames.

In [None]:
#Left join master data and education data by using SQL query
master_education_left_df = pd.read_sql('''SELECT * FROM master m 
                                          LEFT JOIN education e 
                                          ON m.id = e.id''', conn)

#Check number of rows
print('By left joining the master data with the education data, the final DataFrame has', 
      master_education_left_df.shape[0], 'rows.')

In [None]:
#Inner join master data and education data by using SQL query
master_education_inner_df = pd.read_sql('''SELECT * FROM master m 
                                           INNER JOIN education e 
                                           ON m.id = e.id''', conn)

#Check number of rows
print('By inner joining the master data with the education data, the final DataFrame has', 
      master_education_inner_df.shape[0], 'rows.')

#### **Checkpoint 2: Right Join the Master Data and the Employment Data**

Please right join the master data, `master_df`, with the employment data, `employment_df`, by using pandas function `merge()`. Check how the number of rows changes in the final DataFrames.

In [None]:
#Right Join in Python
master_emp_pd_right_df = master_df.merge(employment_df, how = 'right', on = 'id')

#Check number of rows
print('By right joining the master data with the employment data, the final DataFrame has',
      master_emp_pd_right_df.shape[0], 'rows.')