<a href="https://www.kaggle.com/code/ayushgpt8/merging-data?scriptVersionId=156077514" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Merging Data

Combining and reshaping data from multiple sources

For this notebook, we will explore various techniques for combining and reshaping data from multiple sources. We will use a dataset from kaggle, found at: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data and https://www.kaggle.com/datasets/ayushgpt8/new-york-city-airbnb-temperature/data

Table of contents:
* Combining data using pandas library
* Validating Merges
* Debugging Chains
* Exporting to Excel

In [1]:
import numpy as np
import pandas as pd

While on kaggle, the dataset can be directly attached from the side panel and read from `kaggle/input`. If you are running from kaggle, then you can continue as is. if you are running it elsewhere, comment out the cell below this and uncomment the one where the data is read from `data/` directory

In [2]:
# assuming you are running on kaggle and have attached the dataset as input from the sidepanel
airbnb = pd.read_csv('/kaggle/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv', dtype_backend='pyarrow', engine='pyarrow')
airbnb

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


In [3]:
temps = pd.read_csv(
     "/kaggle/input/new-york-city-airbnb-temperature/nyc-ab-temp.csv", index_col=0, dtype_backend="pyarrow", engine="pyarrow"
)
    
temps

Unnamed: 0,lat,lon,temp
,,,
0,40.64749,-73.97237,72.0
1,40.75362,-73.98377,57.0
2,40.80902,-73.94190,76.0
3,40.68514,-73.95976,24.0
4,40.79851,-73.94399,27.0
...,...,...,...
48890,40.67853,-73.94995,76.0
48891,40.70184,-73.93317,71.0
48892,40.81475,-73.94867,44.0


In [4]:
# assuming you have downloaded the dataset and stored it in data/ directory.
# import zipfile

# with zipfile.ZipFile("data/AB_NYC_2019.csv.zip") as zip:
#     print(zip.namelist())

# airbnb = pd.read_csv("data/AB_NYC_2019.csv.zip", dtype_backend="pyarrow", engine="pyarrow")
# temps = pd.read_csv(
#     "data/nyc-ab-temp.csv", index_col=0, dtype_backend="pyarrow", engine="pyarrow"
# )

## Merging

Lets see how merging takes place with pandas

In [5]:
# simple method, but will error out
# (airbnb
#     .merge(temps)
# )

This errors out since by default, pandas look for similar columns to merge on. The error message here clearly tells us that it couldn't find any columns to perform merge on. Just for fun, lets explorer a little.

In [6]:
(airbnb
    .columns
    .intersection(temps.columns)
)

Index([], dtype='object')

As expected, there are no common columns between the two dataframes.

In [7]:
# Here are airbnb columns
airbnb.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

In [8]:
# Here are temps columns
temps.columns

Index(['lat', 'lon', 'temp'], dtype='object')

We will need to explicitly tell pandas the columns to merge on.

In [9]:
(airbnb
    .merge(temps, 
           left_on=['latitude', 'longitude'], 
           right_on=['lat', 'lon']
          )
)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,lat,lon,temp
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,40.64749,-73.97237,72.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,40.75362,-73.98377,57.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,40.80902,-73.94190,76.0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,40.68514,-73.95976,24.0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0,40.79851,-73.94399,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9,40.67853,-73.94995,76.0
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36,40.70184,-73.93317,71.0
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,40.81475,-73.94867,44.0
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2,40.75751,-73.99112,73.0


The `left_on` parameter is refering to the airbnb dataframe and the `right_on` is refering to the temps dataframe.