# INSTRUCTIONS

**Perform a basic exploratory data analysis of the Inside Airbnb dataset and answer some questions**

To complete the assignment, paste the answers where appropriate, and upload the full notebook at the end.


**Submission format**: One Jupyter notebook (`Airbnb_GroupX.ipynb`).

Please do *not* submit:
* A zip file
* A link to Google CoLab
* A file with the wrong extension
* A Python script


To complete the assignment, follow the steps. The data is attached below.


The grading criteria are, in decreasing order of importance and increasing object of subjectivity, as follows:


* Code has no errors, the whole notebook runs from top to bottom without modifications (35 %)
* Code gives correct answers (30 %)
* Code avoids repetition and favours pandas methods where appropriate (loops and conditionals only if strictly necessary) (15 %)
* Code uses meaningful, explanatory variable names (10 %)
Code is as succinct as possible (when there are two ways of doing something, the simplest, shortest, or easier to understand is chosen) (5 %)
  * If you discuss several ways of doing something, with its pros and cons (without just dumping the code and no explanations), that counts positively as well
* Code is easy to read (i.e. "similar to how the professor codes") (5 %)


Optionally, you can include code comments describing the intent (i.e. code comments should answer "why is this code here?", not "what is this code doing?") and supplementary markdown cells if appropriate.



All the questions can be done independently of each other (after reading the data). They are not sorted in any particular order of difficulty.

#IMPORT LIBRARIES AND DATASET


In [None]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('listings.csv')


In [None]:
# Question 1: How many rows does the dataset have? (Excluding the header)
num_rows = len(df)
print(f"Number of rows: {num_rows}")


In [None]:
# Question 2: How many columns does the dataset have? (Excluding the autogenerated numerical index)
num_columns = len(df.columns)
print(f"Number of columns: {num_columns}")


# PART 1: BASIC EXPLORATORY ANALYSIS

In [None]:
# Question 3: How many unique values are there for host_id?
unique_hosts = df['host_id'].nunique()
print(f"Number of unique host_id values: {unique_hosts}")


**Question 1**

How many rows does the dataset have? (Excluding the header containing the column names)

In [None]:
# Question 4: Count how many listings are there per host. Find the host with the largest number of listings.
listings_per_host = df['host_id'].value_counts()
max_listings = listings_per_host.max()
host_with_max_listings = listings_per_host.idxmax()

print(f"Host with the largest number of listings (host_id: {host_with_max_listings}) has {max_listings} listings")


**Question 2**

How many columns does the dataset have? (Excluding the autogenerated numerical index)

In [None]:
# Question 5: How many distinct hosts are superhosts?
superhosts = df[df['host_is_superhost'] == 't']['host_id'].nunique()
print(f"Number of distinct superhosts: {superhosts}")


**Question 3**

How many unique values are there for host_id?

In [None]:
# Question 6: Find the district with the largest number of listings
listings_per_district = df['neighbourhood_group_cleansed'].value_counts()
max_listings_district = listings_per_district.max()
district_with_max = listings_per_district.idxmax()

print(f"District with the largest number of listings: {district_with_max} with {max_listings_district} listings")


**Question 4**

Count how many listings are there per host (where 1 row = 1 listing). Find the host with the largest number of listings. How many listings do they have?

In [None]:
# Question 7: What's the average price of listings?
# First, convert price from string to numeric (remove $ and commas)
df['price_numeric'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)
average_price = df['price_numeric'].mean()
print(f"Average price: ${average_price:.2f}")


**Question 5**

How many distinct hosts are superhosts?

In [None]:
# Question 8: How many listings have zero reviews?
zero_reviews = (df['number_of_reviews'] == 0).sum()
print(f"Number of listings with zero reviews: {zero_reviews}")


**Question 6**

In the city of Madrid there are 2 administrative levels represented in the dataset: neighbourhood and district. Find the district with the largest number of listings. How many does it have?

In [None]:
# Question 9: Listings that are instantly bookable have an average number of reviews per month X % higher than those that are not
instant_bookable_avg = df[df['instant_bookable'] == 't']['reviews_per_month'].mean()
not_instant_bookable_avg = df[df['instant_bookable'] == 'f']['reviews_per_month'].mean()
percentage_difference = ((instant_bookable_avg - not_instant_bookable_avg) / not_instant_bookable_avg) * 100

print(f"Average reviews per month (instant bookable): {instant_bookable_avg:.2f}")
print(f"Average reviews per month (not instant bookable): {not_instant_bookable_avg:.2f}")
print(f"Percentage difference: {percentage_difference:.1f}%")


**Question 7**

What's the average price of listings? (Error of +-1 USD is accepted)

In [None]:
# Question 10: How many listings have missing (null) license information?
missing_license = df['license'].isna().sum()
print(f"Number of listings with missing license information: {missing_license}")


**Question 8**

How many listings have zero reviews?

In [None]:
# Question 11: How many listings have a license containing the string "ESFC"?
esfc_license_count = df['license'].str.contains('ESFC', na=False).sum()
print(f"Number of listings with license containing 'ESFC': {esfc_license_count}")


**Question 9**

Fill the gap: "Listings that are instantly bookable have an average number of reviews per month X % higher than those that are not" (Error of +-1 % is accepted)

In [None]:
# Question 12: How many listings have declared "Exempt" or "En proceso" in the license field?
exempt_or_proceso = df['license'].str.contains('Exempt|En proceso', case=False, na=False).sum()
print(f"Number of listings with 'Exempt' or 'En proceso' in license: {exempt_or_proceso}")


**Question 10**

How many listings have missing (null) license information?

In [None]:
# Question 13: How many hosts cannot be contacted by email?
# Check host_verifications column - if it doesn't contain 'email', they can't be contacted by email
hosts_without_email = df[~df['host_verifications'].str.contains('email', na=False)]['host_id'].nunique()
print(f"Number of hosts that cannot be contacted by email: {hosts_without_email}")


**Question 11**

Some licenses have a very long string starting with ES, followed by 2 letters (type of listing), followed by 2 letters (category), followed by a long list of numbers and some extra characters.

How many listings have a license containing the string "ESFC"?

In [None]:
# Question 14: What's the maximum number of amenities found in any listing?
# Amenities are stored as a JSON-like string array, count items by counting commas and adding 1
# Handle empty arrays (which have no commas but should count as 0)
df['amenities_count'] = df['amenities'].str.count(',') + 1
# Set empty arrays to 0 (empty string, "[]", or NaN)
df.loc[df['amenities'].isna() | (df['amenities'].str.strip() == '[]'), 'amenities_count'] = 0
max_amenities = df['amenities_count'].max()
print(f"Maximum number of amenities in any listing: {max_amenities}")


**Question 12**

How many listings have declared "Exempt" or "En proceso" in the license field?

In [None]:
# Question 15: Which year has the record for number of hosts registered?
# Extract year from host_since column
df['host_since_year'] = pd.to_datetime(df['host_since']).dt.year
hosts_per_year = df.groupby('host_since_year')['host_id'].nunique()
year_with_max_hosts = hosts_per_year.idxmax()
max_hosts_count = hosts_per_year.max()

print(f"Year with the record for number of hosts registered: {year_with_max_hosts} ({max_hosts_count} hosts)")


**Question 13**

How many hosts *cannot* be contacted by email?

In [None]:
# Question 16: Examine the "license" field and identify different structures, count how many listings have each type
# First, let's explore the license field to understand its structure
license_types = {}

# Check for ESFC pattern (ES + 2 letters + 2 letters + numbers)
esfc_pattern = df['license'].str.contains(r'^ES[A-Z]{2}[A-Z]{2}', na=False, regex=True)
license_types['ESFC pattern (ES + 2 letters + 2 letters + numbers)'] = esfc_pattern.sum()

# Check for "Exempt"
exempt = df['license'].str.contains('Exempt', case=False, na=False)
license_types['Exempt'] = exempt.sum()

# Check for "En proceso"
en_proceso = df['license'].str.contains('En proceso', case=False, na=False)
license_types['En proceso'] = en_proceso.sum()

# Check for missing/null values
license_types['Missing/Null'] = df['license'].isna().sum()

# Check for other patterns - let's see what else exists
non_standard = df[~esfc_pattern & ~exempt & ~en_proceso & df['license'].notna()]['license'].unique()
if len(non_standard) > 0:
    print("Other license patterns found:")
    for license_val in non_standard[:10]:  # Show first 10
        count = (df['license'] == license_val).sum()
        license_types[f'Other: {license_val[:50]}'] = count

print("License type counts:")
for license_type, count in license_types.items():
    print(f"  {license_type}: {count}")


**Question 14**

What's the maximum number of amenities found in any listing?

In [None]:
# Question 17: Explore host_location to find different countries, handling typos and special values
# Extract country from host_location (format is often "City, Country" or just "City")
# First, let's see the structure
df['host_location_clean'] = df['host_location'].fillna('')

# Try to extract country - usually after the last comma
df['extracted_country'] = df['host_location_clean'].str.split(',').str[-1].str.strip()

# Count countries
country_counts = df['extracted_country'].value_counts()

# Show top countries (excluding empty strings and common variations)
print("Top countries (excluding Spain):")
top_countries = country_counts[country_counts.index != 'Spain'].head(20)
for country, count in top_countries.items():
    if country:  # Skip empty strings
        print(f"  {country}: {count}")

# Check for typos/variations - look for common patterns
print(f"\nTotal unique country values (including variations): {df['extracted_country'].nunique()}")
print(f"\nTotal unique host_location values: {df['host_location'].nunique()}")


**Question 15**

Which year has the record for number of hosts registered?

In [None]:
# Question 18: Extract price outliers and inspect those listings
# Use IQR method to identify outliers
Q1 = df['price_numeric'].quantile(0.25)
Q3 = df['price_numeric'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[df['price_numeric'] > upper_bound].copy()

print(f"Number of price outliers (above ${upper_bound:.2f}): {len(outliers)}")
print(f"\nOutlier price statistics:")
print(f"  Min price: ${outliers['price_numeric'].min():.2f}")
print(f"  Max price: ${outliers['price_numeric'].max():.2f}")
print(f"  Mean price: ${outliers['price_numeric'].mean():.2f}")

# Most reviewed outliers
most_reviewed_outliers = outliers.nlargest(5, 'number_of_reviews')[['id', 'name', 'price_numeric', 'number_of_reviews', 'amenities_count', 'host_is_superhost', 'review_scores_rating']]
print(f"\nTop 5 most reviewed outliers:")
print(most_reviewed_outliers.to_string())

# Compare amenities
avg_amenities_outliers = outliers['amenities_count'].mean()
avg_amenities_all = df['amenities_count'].mean()
print(f"\nAverage amenities:")
print(f"  Outliers: {avg_amenities_outliers:.2f}")
print(f"  All listings: {avg_amenities_all:.2f}")
print(f"  Difference: {avg_amenities_outliers - avg_amenities_all:.2f}")

# Other interesting insights
print(f"\nOther insights about outliers:")
print(f"  Superhost percentage: {(outliers['host_is_superhost'] == 't').sum() / len(outliers) * 100:.1f}%")
print(f"  Average review score: {outliers['review_scores_rating'].mean():.2f}")
print(f"  Average number of reviews: {outliers['number_of_reviews'].mean():.1f}")


# PART 2: OPEN ENDED ANALYSIS

**Question 16**

Examine the "license" field a bit more closely. There are different structures present, apart from the one described above. Try to identify them and count how many listings have each type of license.

**Question 17**

The "host_location" information is somewhat structured. Sometimes it contains (city, country), sometimes it doesn't. Explore how many different countries are present, trying to pay attention to typos, special values (like state names), and which are the most prevalent ones, other than Spain.

**Question 18**

A few listings seem to be extremely expensive. Devise a method of extracting price outliers, and inspect those listings. Which ones have been most reviewed? Do they have more amenities than average? Highlight anything else that's interesting about them.