# Airbnb Listings ETL
This notebook performs an **Extract, Transform, Load (ETL)** process on Airbnb listings data.

## 1. Import Libraries
We begin by importing necessary Python libraries for data manipulation and analysis.

In [38]:
import re
import pandas as pd
import numpy as np

## 2. Load the Data
We read the Airbnb listings dataset from a CSV file. The separator used in the file is `;`, which needs to be specified.

In [39]:
air_bnb_data = pd.read_csv(r"C:\Users\isabe\PycharmProjects\GSB-520-ETL\Data\air-bnb-listings.csv",
                           sep = ";")

## 3. Preview the Data
We display the first few rows to understand the dataset structure.

In [40]:
air_bnb_data.head()

Unnamed: 0,Room ID,Name,Host ID,Neighbourhood,Room type,Room Price,Minimum nights,Number of reviews,Date last review,Number of reviews per month,Rooms rent by the host,Availibility,Updated Date,City,Country,Coordinates,Location
0,9153958,.2 La Casa sui Tetti dell'Oltrarno,47655975,Centro Storico,Entire home/apt,50,1,363,2020-02-17,6.69,2,0,2020-06-19,Florence,Italy,"43.76678908298443, 11.24589003597409","Italy, Florence, Centro Storico"
1,9191572,Firenze Studio Porta Romana,47816872,Centro Storico,Entire home/apt,70,2,46,2018-03-15,0.83,1,364,2020-06-19,Florence,Italy,"43.76544742217545, 11.24254982481698","Italy, Florence, Centro Storico"
2,9306481,ELEGANT FLAT FLORENCE CITY CENTER,42112573,Centro Storico,Entire home/apt,120,2,132,2020-01-29,2.42,1,270,2020-06-19,Florence,Italy,"43.779967029995476, 11.261314413194068","Italy, Florence, Centro Storico"
3,9524225,Luxury Modern Studio - Florence,11065786,Campo di Marte,Entire home/apt,75,3,56,2019-12-30,1.01,5,364,2020-06-19,Florence,Italy,"43.76815640280208, 11.275230547227133","Italy, Florence, Campo di Marte"
4,9602552,Florence is a dream,956926,Campo di Marte,Entire home/apt,50,2,24,2018-10-27,0.44,1,365,2020-06-19,Florence,Italy,"43.78809311664643, 11.273347846180846","Italy, Florence, Campo di Marte"


## 4. Analyze Country Distribution
We count the number of listings per country.

In [41]:
air_bnb_data["Country"].value_counts()

Country
United states     240662
Italy             179607
Spain             109219
United kingdom    103964
France             90621
Australia          89580
China              88885
Greece             57646
Canada             56659
Ireland            35996
Germany            35985
Portugal           35965
Brazil             35731
Denmark            28523
Argentina          24134
South africa       24062
Turkey             23728
Mexico             21824
Netherlands        19450
Chile              15970
Japan              14715
Austria            12974
Belgium            12808
Czech republic     12565
Norway              8830
Taiwan              8103
Switzerland         7894
Sweden              7635
Singapore           7323
Belize              2960
Name: count, dtype: int64

## 5. Filter down to U.S. Data
We filter the listings down to those located in the United States. 

In [42]:
us_airbnb = air_bnb_data[air_bnb_data["Country"] == "United states"]

In [43]:
us_airbnb.head()

Unnamed: 0,Room ID,Name,Host ID,Neighbourhood,Room type,Room Price,Minimum nights,Number of reviews,Date last review,Number of reviews per month,Rooms rent by the host,Availibility,Updated Date,City,Country,Coordinates,Location
2418,575758,a REAL warehouse LOFT,2832150,Five Points,Entire home/apt,200,180,38,2017-01-09,0.51,1,0,2020-06-28,Denver,United states,"39.767916331935616, -104.97976015587257","United states, Denver, Five Points"
2419,1041934,"Private apt in Berkeley, MTN Views!",5811115,Berkeley,Entire home/apt,215,30,27,2019-12-02,0.32,7,247,2020-06-28,Denver,United states,"39.78122240894239, -105.03904720467456","United states, Denver, Berkeley"
2420,1311993,Downtown Apt - Clean & Convenient! Brooks Towe...,6658113,CBD,Entire home/apt,90,30,7,2017-06-03,0.09,22,317,2020-06-28,Denver,United states,"39.74651081035565, -104.99583469460369","United states, Denver, CBD"
2421,1557739,Full bed with shared bathroom,8289288,Cole,Private room,60,30,125,2019-06-15,1.66,7,0,2020-06-28,Denver,United states,"39.76278786287324, -104.97123079202251","United states, Denver, Cole"
2422,3338717,Private basement apartment on park.,16853725,Rosedale,Entire home/apt,90,2,92,2020-06-13,1.26,1,306,2020-06-28,Denver,United states,"39.675036382043245, -104.97842716027984","United states, Denver, Rosedale"


## 6. Process Location Data
We extract latitude and longitude from the 'Coordinates' column and convert them to numerical values.

In [44]:
# Splitting the 'Coordinates' column into 'Latitude' and 'Longitude'
us_airbnb[['Latitude', 'Longitude']] = us_airbnb['Coordinates'].str.split(', ', expand=True)

# Convert Latitude and Longitude to float for numerical operations
us_airbnb.loc[:, 'Latitude'] = us_airbnb['Latitude'].astype(float)
us_airbnb.loc[:, 'Longitude'] = us_airbnb['Longitude'].astype(float)

# Dropping the original 'Coordinates' column
us_airbnb.drop(columns=['Coordinates'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  us_airbnb[['Latitude', 'Longitude']] = us_airbnb['Coordinates'].str.split(', ', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  us_airbnb[['Latitude', 'Longitude']] = us_airbnb['Coordinates'].str.split(', ', expand=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  us_airbnb.drop(columns=['Coordinates'], inplace=True)


## 7. Standardize City Names
We clean and standardize city names by replacing hyphens with spaces, capitalizing abbreviations, and removing unnecessary state abbreviations.

In [45]:
# Clean the city names: replace hyphens with spaces and convert to title case
us_airbnb['City'] = us_airbnb['City'].str.replace("-", " ").str.title()

# Define a mapping for abbreviations that should be fully capitalized.
abbrev_map = {
    'Dc': 'DC',
    'Nv': 'NV',
    'Msa': 'MSA',
    'Or': 'OR'
}

def capitalize_abbrev(city):
    # Split the city name into words
    words = city.split()
    # Replace any word found in our abbreviation map with its uppercase version
    words = [abbrev_map.get(word, word) for word in words]
    # Rejoin the words into a string
    return " ".join(words)

# Apply the function to the cleaned city names
us_airbnb['City'] = us_airbnb['City'].apply(capitalize_abbrev)

# Adjust specific city names (e.g., "New York City" to "New York")
def adjust_city_name(city):
    if city == "New York City":
        return "New York"
    return city

us_airbnb['City'] = us_airbnb['City'].apply(adjust_city_name)

def drop_state(city):
    """
    Remove a trailing state abbreviation (a space followed by 2+ uppercase letters)
    from the city name unless the city is "Washington DC".
    """
    # If the city is "Washington DC", keep it as-is (case-insensitive comparison)
    if city.strip().lower() == "washington dc":
        return str('Washington, D.C.')
    # Otherwise, remove trailing state abbreviation
    return re.sub(r'\s+[A-Z]{2,}$', '', city)

# Apply the function to the 'City' column and store the result in a new column
us_airbnb["City"] = us_airbnb["City"].apply(drop_state)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  us_airbnb['City'] = us_airbnb['City'].str.replace("-", " ").str.title()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  us_airbnb['City'] = us_airbnb['City'].apply(capitalize_abbrev)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  us_airbnb['City'] = us_airbnb['City'].apply(adjust_city_name)
A value 

## 8. Check for Missing Values
We check for any missing values in the dataset.

In [46]:
us_airbnb.isna().sum().sort_values(ascending=False)

Number of reviews per month    49937
Date last review               49937
Name                              32
Room ID                            0
Host ID                            0
Neighbourhood                      0
Room Price                         0
Room type                          0
Number of reviews                  0
Minimum nights                     0
Rooms rent by the host             0
Availibility                       0
Updated Date                       0
City                               0
Country                            0
Location                           0
Latitude                           0
Longitude                          0
dtype: int64

## 9. Handle Missing Data
We fill or adjust missing values for specific columns.

In [47]:
# Ensure we're working on a copy to avoid warnings
final_airbnb_df = us_airbnb.copy()

# Fill missing values for 'Number of reviews per month'
final_airbnb_df['Number of reviews per month'] = final_airbnb_df['Number of reviews per month'].fillna(0)

# Convert 'Date last review' to datetime and handle NaN values
final_airbnb_df['Date last review'] = pd.to_datetime(final_airbnb_df['Date last review'])
final_airbnb_df['Date last review'] = final_airbnb_df['Date last review'].fillna(pd.NaT)

# Fill missing 'Name' values with a placeholder
final_airbnb_df['Name'] = final_airbnb_df['Name'].fillna("Unnamed Listing")

### Append State column
Add a state column via geocoding for ease of creating relationships with the other datasets.

In [48]:
# Define a dictionary mapping city-like names to states
city_to_state = {
    "New York": "NY",
    "Los Angeles": "CA",
    "Hawaii": "HI",  # Entire state
    "San Diego": "CA",
    "Broward County": "FL",
    "Clark County": "NV",
    "Austin": "TX",
    "Washington, D.C.": "DC",
    "Chicago": "IL",
    "San Francisco": "CA",
    "Santa Clara County": "CA",
    "New Orleans": "LA",
    "Nashville": "TN",
    "Seattle": "WA",
    "Twin Cities": "MN",  # Minneapolis & St. Paul (MN)
    "Portland": "OR",  # Assuming Portland, OR, not ME
    "Denver": "CO",
    "Rhode Island": "RI",  # Entire state
    "Boston": "MA",
    "Oakland": "CA",
    "San Mateo County": "CA",
    "Jersey City": "NJ",
    "Asheville": "NC",
    "Santa Cruz County": "CA",
    "Columbus": "OH",
    "Cambridge": "MA",
    "Salem": "OR",  # Assuming Salem, OR, not MA
    "Pacific Grove": "CA"
}

# Function to apply the mapping
def map_city_to_state(city):
    return city_to_state.get(city, "Unknown")  # Default to 'Unknown' if not found

# Example usage with a dataset column
final_airbnb_df["State"] = final_airbnb_df["City"].apply(map_city_to_state)

In [50]:
final_airbnb_df["State"].value_counts()

State
CA    70918
NY    48588
HI    22917
FL    10858
NV    10568
TX    10321
DC     8708
IL     7628
LA     6918
TN     6848
WA     6575
MN     6470
OR     4488
MA     4469
CO     4200
RI     3884
NJ     2488
NC     2407
OH     1409
Name: count, dtype: int64

## 10. Save the Cleaned Dataset
Finally, we save the cleaned dataset for further analysis.

In [49]:
# Save Cleaned Dataset
final_airbnb_df.to_csv(
    r"C:\Users\isabe\PycharmProjects\GSB-520-ETL\Clean Data\cleaned_airbnb_data.csv", index=False)
