## Assignment 5: Toddler Project

In [2]:
import pandas as pd
import numpy as np
from datetime import date

In [3]:
dtravel = pd.read_csv("travelq.csv")
otravel = pd.read_csv("orgdetail.csv")

In [4]:
dtravel = dtravel.head(1000)

# Basic Cleaning

## Standarizing columns

This code converts specified numeric columns to numeric type (with errors coerced to NaN) and rounds them to 2 decimal places. Non-numeric columns are converted to strings.

In [8]:
numeric_col = [
    "airfare", "other_transport", "lodging", "meals", "other_expenses", "total"
]
dtravel[numeric_col] = dtravel[numeric_col].apply(pd.to_numeric, errors='coerce').round(2)

non_numeric_col = dtravel.select_dtypes(exclude=['number']).columns
dtravel[non_numeric_col] = dtravel[non_numeric_col].astype(str)

This code removes unnecessary columns from the `dtravel` DataFrame.

In [10]:
def remove_columns(dtravel, columns_to_remove):
    dtravel = dtravel.drop(columns=columns_to_remove, errors='ignore')
    print(f"Removed columns: {', '.join(columns_to_remove)}")
    return dtravel

columns_to_remove = [
    'disclosure_group', 'title_fr', 'purpose_fr', 'destination_fr', 
    'additional_comments_en', 'additional_comments_fr'
]

dtravel = remove_columns(dtravel, columns_to_remove)

Removed columns: disclosure_group, title_fr, purpose_fr, destination_fr, additional_comments_en, additional_comments_fr


This code identifies and replaces outliers in numeric columns using the IQR method.

In [12]:
def remove_outliers(dtravel, numerical_columns):
    for column in numerical_columns:
        Q1 = dtravel[column].quantile(0.25)
        Q3 = dtravel[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        dtravel.loc[(dtravel[column] < lower_bound) | (dtravel[column] > upper_bound), column] = np.nan
        print(f"Processed Outliers for {column} (Outliers replaced with NaN)")
    
    return dtravel 
numerical_columns = dtravel.select_dtypes(include=['number']).columns.tolist()
dtravel = remove_outliers(dtravel, numerical_columns)

Processed Outliers for airfare (Outliers replaced with NaN)
Processed Outliers for other_transport (Outliers replaced with NaN)
Processed Outliers for lodging (Outliers replaced with NaN)
Processed Outliers for meals (Outliers replaced with NaN)
Processed Outliers for other_expenses (Outliers replaced with NaN)
Processed Outliers for total (Outliers replaced with NaN)


This code cleans the `owner_org_title` column by removing the French part after the '|' character.

In [14]:
def clean_owner_org_title(dtravel):
    dtravel['owner_org_title'] = dtravel['owner_org_title'].str.split(' \| ').str[0]
    return dtravel
cleaned_df = clean_owner_org_title(dtravel)
print("Removed the french out of 'owner_org_title")

Removed the french out of 'owner_org_title


  dtravel['owner_org_title'] = dtravel['owner_org_title'].str.split(' \| ').str[0]


# Basic Information

## Understanding the dataset

we will look at both of the datasets seperatly, understanding them. Before merging them together
- first 2 rows of the dataset
- number of rows / number of columns
- summary of the dataset, including (count, unique values, top values, frequency, mean, standard deviation, etc) for numeric columns

In [18]:
print("The following information for orgdetail.csv")
print("rows =", otravel.shape[0], ", columns =", otravel.shape[1])
otravel.head(2)

The following information for orgdetail.csv
rows = 1000 , columns = 4


Unnamed: 0,owner_org_title,owner_org,annual_budget,num_employees
0,Agriculture and Agri-Food Canada,atssc-scdata,60256967,75976
1,Accessibility Standards Canada,aafc-aac,9374190,19470


In [19]:
print("Dataset summary for orgdetail")
otravel.describe(include="all").round(2)

Dataset summary for orgdetail


Unnamed: 0,owner_org_title,owner_org,annual_budget,num_employees
count,1000,1000,1000.0,1000.0
unique,3,3,,
top,Agriculture and Agri-Food Canada,aafc-aac,,
freq,338,341,,
mean,,,52295148.77,50411.98
std,,,27981661.59,29136.16
min,,,5217735.0,414.0
25%,,,28366150.25,24291.75
50%,,,51745855.0,49656.5
75%,,,76997335.0,75831.25


In [20]:
print("The following information for orgdetail.csv")
print("rows =", dtravel.shape[0], ", columns =", dtravel.shape[1])
dtravel.head(2)

The following information for orgdetail.csv
rows = 1000 , columns = 15


Unnamed: 0,ref_number,title_en,name,purpose_en,start_date,end_date,destination_en,airfare,other_transport,lodging,meals,other_expenses,total,owner_org,owner_org_title
0,T-20120-P11-001,Chief Executive Officer,Philip Rizcallah,To attend meeting with Saskatchewan Provincial...,2020-02-03,2020-02-04,"Regina, Saskatchewan, Canada",646.17,117.26,157.78,197.25,0.0,1118.46,casdo-ocena,Accessibility Standards Canada
1,T-2020-P11-0001,Chair,"Bérubé, Paul-Claude",Board members meeting,2020-02-09,2020-02-13,"Vancouver, British Columbia, Canada",1104.27,189.72,841.31,461.84,,2597.14,casdo-ocena,Accessibility Standards Canada


In [21]:
print("Dataset summary for travelq")
dtravel.describe(include="all").round(2)

Dataset summary for travelq


Unnamed: 0,ref_number,title_en,name,purpose_en,start_date,end_date,destination_en,airfare,other_transport,lodging,meals,other_expenses,total,owner_org,owner_org_title
count,1000,1000,1000,1000,1000,1000,1000,706.0,849.0,818.0,869.0,404.0,931.0,1000,1000
unique,949,108,113,798,627,620,368,,,,,,,3,3
top,T-2024-P12-0001,Chairperson,"Rizcallah, Philip",Board meeting,2020-02-09,2017-07-21,"Ottawa, ON",,,,,,,atssc-scdata,Administrative Tribunals Support Service of Ca...
freq,2,80,71,36,10,10,53,,,,,,,467,467
mean,,,,,,,,772.39,151.72,488.52,255.31,3.08,1490.72,,
std,,,,,,,,552.96,104.86,369.42,153.96,7.15,1031.55,,
min,,,,,,,,0.0,0.0,0.0,0.0,0.0,3.6,,
25%,,,,,,,,427.14,70.0,208.25,145.95,0.0,720.58,,
50%,,,,,,,,728.7,134.37,405.18,232.4,0.0,1398.79,,
75%,,,,,,,,1061.83,224.44,701.58,352.5,0.0,2058.38,,


# Merge Dataset

We will be using a **LEFT JOIN** to merge the two datasets, ensuring that the first dataset is preserved while looking for matching values in the second dataset. Specifically, we will match the owner_org_title from both the orgdetail and travelq datasets.

In [24]:
mmerge = dtravel.merge(otravel, on=["owner_org_title", "owner_org"], how="left")
mmerge

Unnamed: 0,ref_number,title_en,name,purpose_en,start_date,end_date,destination_en,airfare,other_transport,lodging,meals,other_expenses,total,owner_org,owner_org_title,annual_budget,num_employees
0,T-20120-P11-001,Chief Executive Officer,Philip Rizcallah,To attend meeting with Saskatchewan Provincial...,2020-02-03,2020-02-04,"Regina, Saskatchewan, Canada",646.17,117.26,157.78,197.25,0.0,1118.46,casdo-ocena,Accessibility Standards Canada,41864973,75203
1,T-20120-P11-001,Chief Executive Officer,Philip Rizcallah,To attend meeting with Saskatchewan Provincial...,2020-02-03,2020-02-04,"Regina, Saskatchewan, Canada",646.17,117.26,157.78,197.25,0.0,1118.46,casdo-ocena,Accessibility Standards Canada,27840441,86677
2,T-20120-P11-001,Chief Executive Officer,Philip Rizcallah,To attend meeting with Saskatchewan Provincial...,2020-02-03,2020-02-04,"Regina, Saskatchewan, Canada",646.17,117.26,157.78,197.25,0.0,1118.46,casdo-ocena,Accessibility Standards Canada,68075313,3063
3,T-20120-P11-001,Chief Executive Officer,Philip Rizcallah,To attend meeting with Saskatchewan Provincial...,2020-02-03,2020-02-04,"Regina, Saskatchewan, Canada",646.17,117.26,157.78,197.25,0.0,1118.46,casdo-ocena,Accessibility Standards Canada,51845886,25242
4,T-20120-P11-001,Chief Executive Officer,Philip Rizcallah,To attend meeting with Saskatchewan Provincial...,2020-02-03,2020-02-04,"Regina, Saskatchewan, Canada",646.17,117.26,157.78,197.25,0.0,1118.46,casdo-ocena,Accessibility Standards Canada,83131133,50998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110699,T-2018-Q1-00020,Director of Communications,Guy Gallant,To support the Minister in the announcement of...,2018-05-23,2018-05-25,"Winnipeg, Manitoba, Toronto, Ontario, Charlott...",2162.73,0.00,353.83,273.65,0.0,2790.21,aafc-aac,Agriculture and Agri-Food Canada,27085861,93850
110700,T-2018-Q1-00020,Director of Communications,Guy Gallant,To support the Minister in the announcement of...,2018-05-23,2018-05-25,"Winnipeg, Manitoba, Toronto, Ontario, Charlott...",2162.73,0.00,353.83,273.65,0.0,2790.21,aafc-aac,Agriculture and Agri-Food Canada,13393710,27214
110701,T-2018-Q1-00020,Director of Communications,Guy Gallant,To support the Minister in the announcement of...,2018-05-23,2018-05-25,"Winnipeg, Manitoba, Toronto, Ontario, Charlott...",2162.73,0.00,353.83,273.65,0.0,2790.21,aafc-aac,Agriculture and Agri-Food Canada,52274346,70085
110702,T-2018-Q1-00020,Director of Communications,Guy Gallant,To support the Minister in the announcement of...,2018-05-23,2018-05-25,"Winnipeg, Manitoba, Toronto, Ontario, Charlott...",2162.73,0.00,353.83,273.65,0.0,2790.21,aafc-aac,Agriculture and Agri-Food Canada,15163549,63142


# Relfection

## Question 1
1. left (DataFrame | Series)
This is the first DataFrame or Series you are merging. It will be the "left" DataFrame in the merge.
2. right (DataFrame | Series)
This is the second DataFrame or Series you are merging. It will be the "right" DataFrame in the merge.
3. how ('MergeHow' = 'inner')
Defines how the merge is done:
'inner': Only keeps rows with matching values in both DataFrames.
'outer': Keeps all rows from both DataFrames, filling non-matches with NaN.
'left': Keeps all rows from the left DataFrame.
'right': Keeps all rows from the right DataFrame.
4. on (IndexLabel | AnyArrayLike | None = None)
Specifies which column(s) to merge on, assuming the column names are the same in both DataFrames.
Example: If both DataFrames have a column called id, you can merge on on='id'.
5. left_on (IndexLabel | AnyArrayLike | None = None)
Specifies the column(s) in the left DataFrame to merge on. Use this when the column names are different between the two DataFrames.
6. right_on (IndexLabel | AnyArrayLike | None = None)
Specifies the column(s) in the right DataFrame to merge on. Use this when the column names are different between the two DataFrames.
7. left_index (bool = False)
If True, uses the index of the left DataFrame as the key for merging. If False, it uses the columns.
8. right_index (bool = False)
If True, uses the index of the right DataFrame as the key for merging. If False, it uses the columns.
9. sort (bool = False)
If True, sorts the resulting DataFrame by the merge keys. If False, the order of the rows remains unchanged.
10. suffixes (Suffixes = ('_x', '_y'))
Specifies suffixes to add to columns that have the same name in both DataFrames, to distinguish them.
Example: If both DataFrames have a column name, the resulting columns will be named name_x and name_y.
11. copy (bool | None = None)
If True, forces the result to be a new DataFrame (a copy). If None, it is inferred from the merge.
12. indicator (str | bool = False)
If True, adds a special _merge column to the result to show where each row came from (left_only, right_only, or both).
Example: Useful for identifying rows that matched only in one DataFrame or in both.
13. validate (str | None = None)
Used for data validation:
'one_to_one': Ensures there’s a 1:1 relationship between keys.
'one_to_many': Ensures there’s a 1:n relationship between the left and right DataFrames.
'many_to_one': Ensures there’s a n:1 relationship between the left and right DataFrames.
'many_to_many': Allows a many-to-many relationship.

## Question 2
A Compound Key is a combination of multiple columns that together create a unique identifier for rows in a DataFrame. In your example with the "Iron Man" from comics vs. the MCU, you could create a compound key using a combination of columns like "character_name" (e.g., "Iron Man") and "source" (e.g., "comic" or "MCU").

## Question 3
Yes I believe there is a difference, a **join** is looking for specific sets and matching them together, whereas a **merge** is mixing the two completly regardless if it doesn't match

## Question 4
I believe 'inner join' should be the default because it only includes rows that have matching values in both dataframes. Ensuring that the result contains only the relevant data from both sides

# Final File

In [31]:
date_str = date.today().strftime("%Y-%m-%d")
merged_file_name = f"merged_travel_data_{date_str}.csv"
mmerge.to_csv(merged_file_name, index=False)
print(f"Merged dataset saved as: {merged_file_name}")

Merged dataset saved as: merged_travel_data_2025-02-18.csv
