# Notebook 01: Data Loading and Overview

## Purpose
This notebook serves as the initial exploration phase of our data science project. The goal is to:
- Load the NYC Airbnb dataset
- Understand the structure and composition of the data
- Identify data types and initial quality issues
- Generate summary statistics

## Learning Objectives
- Demonstrate systematic approach to exploring unknown datasets
- Identify numerical and categorical features
- Detect missing values and potential data quality issues

---
## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---
## 2. Load the Dataset

We load the dataset from the `data/` directory. This dataset contains information about Airbnb listings in New York City.

In [None]:
# Load the dataset
df = pd.read_csv('../data/MinoAI_dataset.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Number of observations: {df.shape[0]:,}")
print(f"Number of variables: {df.shape[1]}")

---
## 3. Initial Data Inspection

### 3.1 First Few Rows
Examining the first few rows helps us understand the structure and content of the dataset.

In [None]:
# Display first 5 rows
df.head()

### 3.2 Last Few Rows
Checking the last rows ensures data consistency throughout the dataset.

In [None]:
# Display last 5 rows
df.tail()

### 3.3 Random Sample
Viewing random samples provides a better overall picture of the data.

In [None]:
# Display 10 random rows
df.sample(10, random_state=42)

---
## 4. Dataset Information

### 4.1 Column Names and Data Types

In [None]:
# Display dataset information
print("Dataset Information:")
print("="*80)
df.info()

### 4.2 Column List

In [None]:
# List all columns
print("Column Names:")
print("="*80)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

---
## 5. Data Type Classification

Identifying numerical and categorical features is crucial for subsequent analysis.

In [None]:
# Identify numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print("NUMERICAL COLUMNS:")
print("="*80)
for i, col in enumerate(numerical_cols, 1):
    print(f"{i:2d}. {col}")

print("\nCATEGORICAL COLUMNS:")
print("="*80)
for i, col in enumerate(categorical_cols, 1):
    print(f"{i:2d}. {col}")

print(f"\nTotal Numerical: {len(numerical_cols)}")
print(f"Total Categorical: {len(categorical_cols)}")

---
## 6. Summary Statistics

### 6.1 Numerical Features

In [None]:
# Statistical summary of numerical features
df.describe()

### 6.2 Categorical Features

In [None]:
# Statistical summary of categorical features
df.describe(include=['object'])

---
## 7. Missing Values Analysis

Identifying missing values is critical for data cleaning in the next notebook.

In [None]:
# Count missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

# Create a summary dataframe
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': missing_percentage.values
})

# Sort by missing count (descending)
missing_summary = missing_summary.sort_values('Missing_Count', ascending=False)

# Display only columns with missing values
print("MISSING VALUES SUMMARY:")
print("="*80)
missing_summary[missing_summary['Missing_Count'] > 0]

### 7.1 Visualize Missing Values

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))

# Filter columns with missing values
cols_with_missing = missing_summary[missing_summary['Missing_Count'] > 0]

if len(cols_with_missing) > 0:
    plt.barh(cols_with_missing['Column'], cols_with_missing['Missing_Percentage'])
    plt.xlabel('Missing Percentage (%)', fontsize=12)
    plt.ylabel('Column Name', fontsize=12)
    plt.title('Missing Values by Column', fontsize=14, fontweight='bold')
    plt.grid(axis='x', alpha=0.3)
    
    # Add percentage labels
    for i, (col, pct) in enumerate(zip(cols_with_missing['Column'], cols_with_missing['Missing_Percentage'])):
        plt.text(pct + 0.5, i, f'{pct:.2f}%', va='center')
    
    plt.tight_layout()
    plt.show()
else:
    print("No missing values found in the dataset!")

---
## 8. Unique Values Analysis

Understanding the cardinality of categorical features helps in feature engineering.

In [None]:
# Analyze unique values for each column
unique_summary = pd.DataFrame({
    'Column': df.columns,
    'Unique_Count': [df[col].nunique() for col in df.columns],
    'Data_Type': df.dtypes.values
})

print("UNIQUE VALUES SUMMARY:")
print("="*80)
unique_summary

### 8.1 Categorical Feature Distribution

In [None]:
# Display unique values for key categorical columns
key_categorical = ['neighbourhood_group', 'room_type']

for col in key_categorical:
    if col in df.columns:
        print(f"\n{col.upper()}:")
        print("="*80)
        print(df[col].value_counts())
        print(f"\nUnique values: {df[col].nunique()}")

---
## 9. Initial Observations and Insights

### Key Findings:

Based on our initial exploration, we can make the following observations:

1. **Dataset Size**: The dataset contains approximately 49,000 Airbnb listings across New York City.

2. **Features**: We have 16 variables including:
   - Listing identifiers (id, name)
   - Host information (host_id, host_name)
   - Location data (neighbourhood_group, neighbourhood, latitude, longitude)
   - Property characteristics (room_type, price, minimum_nights)
   - Review metrics (number_of_reviews, last_review, reviews_per_month)
   - Availability (availability_365)

3. **Missing Values**: Several columns contain missing values, particularly:
   - `last_review` and `reviews_per_month` (likely for listings with no reviews)
   - `name` and `host_name` (some missing text data)

4. **Data Types**: 
   - Numerical features: price, minimum_nights, number_of_reviews, etc.
   - Categorical features: neighbourhood_group, room_type, etc.
   - Text features: name, host_name

5. **Potential Issues**:
   - Date column (`last_review`) is stored as object/string, needs conversion
   - Missing values need to be handled appropriately
   - Potential outliers in price and other numerical features

### Next Steps:
In the next notebook (02_data_cleaning.ipynb), we will:
- Handle missing values using multiple techniques (forward fill, backward fill, interpolation, mean imputation)
- Convert data types where necessary
- Check for and remove duplicates
- Prepare the dataset for exploratory data analysis

---
## 10. Save Summary Information

We'll save key information for reference in subsequent notebooks.

In [None]:
# Save column information
print("Dataset Overview Summary:")
print("="*80)
print(f"Total Rows: {df.shape[0]:,}")
print(f"Total Columns: {df.shape[1]}")
print(f"Numerical Columns: {len(numerical_cols)}")
print(f"Categorical Columns: {len(categorical_cols)}")
print(f"Columns with Missing Values: {(missing_values > 0).sum()}")
print(f"Total Missing Values: {missing_values.sum():,}")
print(f"Missing Data Percentage: {(missing_values.sum() / (df.shape[0] * df.shape[1]) * 100):.2f}%")

---
## Conclusion

This notebook has successfully:
- ✅ Loaded the NYC Airbnb dataset
- ✅ Examined the structure and composition of the data
- ✅ Identified numerical and categorical features
- ✅ Detected missing values and their distribution
- ✅ Generated summary statistics
- ✅ Documented initial observations

The dataset is now ready for the data cleaning phase in the next notebook.

---
**Next Notebook**: [02_data_cleaning.ipynb](02_data_cleaning.ipynb)