
# TASK 2 (Data analysis)

### In this task, use your data analytics skills to answer the question posed in the Task 1. Depending upon your chosen question, you will typically have to perform Exploratory Data Analysis (EDA), data pre-processing, statistics-based data analysis, data visualisation and use unsupervised machine learning algorithms (e.g., clustering).

### Business Analytical Question:  What are the key determinants of pricing for Airbnb listings in NYC? How do factors like location, property type, and reviews influence the price?



<a id = "table-of-content"></a>
# Table of Content

- [Business Understanding](#business_undestanding)
- [Data Understanding](#data_undestanding)
- [Data Preparation](#data_preparation)
- [Modelling and Evaluation](#modelling_n_evaluation)
- [Conclusion](#conclusion)

[link text](https:// [link text](https://))<a id = "business_undestanding"></a>
# Business Understanding
What are the key determinants of pricing for Airbnb listings in NYC? How do factors like location, property type, and reviews influence the price?

<a id = "data_understanding"></a>
# Data Understanding
Here we will look at the data and hopefully find interesting patterns in it.

We will try and see if the columns below have a strong correlation to the *price*:

* name - Looks like a description
* neighbourhood_group
* neighbourhood 
* latitude 
* longitude 
* room_type 
* price
* minimum_nights 
* number_of_reviews 
* last_review
* reviews_per_month 
* calculated_host_listings_count
* availability_365


In [None]:
#import pandas and numpy for data preprocessing and manipulation

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Set the aesthetics for the plots
sns.set(style="whitegrid")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# mounting google drive and loading csv file into dataframe

# folder_path = '/content/drive/MyDrive/TeamProject/AB_NYC_2019.csv'
folder_path = '/kaggle/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv'

# from google.colab import drive
# drive.mount('/content/drive')

df = pd.read_csv(folder_path)
df.describe()

In [None]:
#inspect data
df.head()

In [None]:
#number of rows and columns
df.shape

In [None]:
#column names
df.columns

In [None]:
#check data types
df.dtypes

## Numerical columns

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

In [None]:
# Calculate the percentage of null values for each column
total_rows = len(df)
null_counts = df.isnull().sum()

# Sort the columns in descending order of null value counts
sorted_null_counts = null_counts.sort_values(ascending=False)

# Select top 20 columns
top_20_null_counts = sorted_null_counts.head(20)

# Calculate percentage for these top 20 columns
top_20_null_percentage = (top_20_null_counts / total_rows) * 100

# Plot
fig, ax = plt.subplots(figsize=(14, 7))
top_20_null_percentage.plot(kind='bar')
ax.set_title('Percentage of Null Values in Columns')
ax.set_ylabel('Percentage of Null Values (%)')
ax.set_xlabel('Column Names')
ax.set_yticks(np.arange(0, 100, 10))
#Add caption
fig.subplots_adjust(bottom=0.2)  # Adjust the bottom margin to make space for the caption
fig.text(0.5, -0.05, "Figure 1.0", ha='center', va='center', fontsize=10, wrap=True)
plt.show()

In [None]:
# Handling missing values
df['name'].fillna('Unknown', inplace=True) #replace missing names with 'unknown'
df['host_name'].fillna('Unknown', inplace=True) #replace missing host names with 'unknown'
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')  # Convert to datetime
df['last_review'].fillna(pd.Timestamp('1900-01-01'), inplace=True)  # Placeholder for no reviews
df['reviews_per_month'].fillna(0, inplace=True)  # Replacing no reviews with 0


In [None]:
#check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

In [None]:
msno.matrix(df.sample(250))

In [None]:
msno.heatmap(df)

## Univariate Analysis

### Price
Looks like we have some prices that are 0. It does not make sense why prices for a listing should be zero so imputing might be necessary.
If imputing is required then I propose using an unsupervised imputing like KNN.

In [None]:
df[df['price'] <= 0].size

In [None]:
df[df['price'] <= 0].sort_values(by='price', ascending=True)

In [None]:
#Distribution of Prices,
# Create a boxplot for the price distribution
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['price'])
plt.title('Boxplot of Price Distribution')
plt.xlabel('Price')
plt.show()

In [None]:
# Create a boxplot for the price distribution
plt.figure(figsize=(10, 6))
plt.hist(df['price'], bins=100)
plt.title('Histogram of Price Distribution')
plt.xlabel('Price')
plt.show()

In [None]:
df['normalised_price'] = np.log(df['price'])
# Create a boxplot for the price distribution
# plt.figure(figsize=(10, 6))
# plt.hist(df['normalised_price'], bins=10)
# plt.title('Histogram of Normalized Price Distribution')
# plt.xlabel('Normalised Price')
# plt.show()
df['normalised_price'].min()

### Room Types

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='room_type', y='price', data=df)
plt.title('Price Distribution Across Room Types')
plt.xlabel('Room Type')
plt.ylabel('Price')
plt.show()

In [None]:
#Extract numerical features for examination
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.columns

##Extract categorical features for examination
categorical_df = df.select_dtypes(include=[np.object])
categorical_df.columns