# NYC Noise Complaints and Liquor License Analysis Pipeline

This notebook implements a complete data processing and machine learning pipeline for analyzing the relationship between noise complaints and liquor-licensed establishments in NYC.

## Project Overview
- **Goal**: Analyze noise complaints associated with areas that have liquor licenses
- **Data Sources**: 
  - NYC 311 Noise Complaints (311_noise.csv)
  - NYC State Liquor Authority (SLA) Active Licenses (sla_active.csv)

## Pipeline Stages
1. Data Loading and Exploration
2. Data Preprocessing
3. Spatial Analysis and Feature Engineering
4. Model Training and Evaluation
5. Results and Insights

## 1. Setup and Imports

In [None]:
# Standard library imports
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print('Environment setup complete!')

## 2. Data Loading

Load the SLA liquor license data and prepare for 311 noise complaints data integration.

In [None]:
# Define data paths
DATA_DIR = Path('../data')
RAW_DIR = DATA_DIR / 'raw'
INTERIM_DIR = DATA_DIR / 'interim'
PROCESSED_DIR = DATA_DIR / 'processed'

# Create directories if they don't exist
INTERIM_DIR.mkdir(parents=True, exist_ok=True)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# Load SLA liquor license data
sla_path = RAW_DIR / 'sla_active.csv'
noise_path = RAW_DIR / '311_noise.csv'

print(f'Loading liquor license data from: {sla_path}')
df_sla = pd.read_csv(sla_path)
print(f'Loaded {len(df_sla)} liquor license records')
print(f'\nSLA Data Shape: {df_sla.shape}')
print(f'\nSLA Data Columns: {df_sla.columns.tolist()}')

# Check for noise complaints data
if noise_path.exists():
    print(f'\nLoading noise complaints data from: {noise_path}')
    df_noise = pd.read_csv(noise_path)
    print(f'Loaded {len(df_noise)} noise complaint records')
    print(f'Noise Data Shape: {df_noise.shape}')
else:
    print(f'\nNote: 311 noise complaints data not found at {noise_path}')
    print('To complete the analysis, please download NYC 311 data and place it in data/raw/')
    print('The pipeline will continue with liquor license data only.')
    df_noise = None

## 3. Data Exploration and Understanding

In [None]:
# Display basic information about SLA data
print('=== SLA Liquor License Data Overview ===')
print(f'\nFirst few records:')
display(df_sla.head())

print(f'\nData Types:')
print(df_sla.dtypes)

print(f'\nMissing Values:')
missing = df_sla.isnull().sum()
print(missing[missing > 0])

print(f'\nBasic Statistics:')
display(df_sla.describe())

In [None]:
# Analyze license types and descriptions
print('=== License Type Distribution ===')
print(df_sla['Description'].value_counts())

# Plot distribution of license types
plt.figure(figsize=(12, 6))
df_sla['Description'].value_counts().plot(kind='bar')
plt.title('Distribution of Liquor License Types')
plt.xlabel('License Type')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## 4. Data Preprocessing

Clean and prepare the data for analysis and modeling.

In [None]:
# Create a copy for processing
df_processed = df_sla.copy()

# Extract coordinates from Georeference column
def extract_coordinates(georeference):
    """Extract latitude and longitude from POINT string"""
    try:
        coords = georeference.replace('POINT (', '').replace(')', '')
        lon, lat = coords.split()
        return float(lat), float(lon)
    except:
        return None, None

df_processed[['Latitude', 'Longitude']] = df_processed['Georeference'].apply(
    lambda x: pd.Series(extract_coordinates(x))
)

print(f'Coordinates extracted. Records with valid coordinates: {df_processed["Latitude"].notna().sum()}')

# Convert date columns to datetime
date_columns = ['Original Issue Date', 'Last Issue Date', 'Effective Date', 'Expiration Date']
for col in date_columns:
    df_processed[col] = pd.to_datetime(df_processed[col], errors='coerce')

# Calculate license age and days until expiration
df_processed['License_Age_Days'] = (pd.Timestamp.now() - df_processed['Original Issue Date']).dt.days
df_processed['Days_Until_Expiration'] = (df_processed['Expiration Date'] - pd.Timestamp.now()).dt.days

# Fill missing DBA with business name
df_processed['DBA'] = df_processed['DBA'].fillna(df_processed['LegalName'])

# Create binary feature for whether license is close to expiration
df_processed['Expiring_Soon'] = (df_processed['Days_Until_Expiration'] < 90).astype(int)

print('\nData preprocessing complete!')
print(f'Processed data shape: {df_processed.shape}')

In [None]:
# Save interim processed data
interim_path = INTERIM_DIR / 'sla_processed.csv'
df_processed.to_csv(interim_path, index=False)
print(f'Interim data saved to: {interim_path}')

## 5. Feature Engineering

Create features for machine learning models.

In [None]:
# Select features for modeling
feature_columns = [
    'Type', 'Class', 'Description', 'Premises County',
    'Latitude', 'Longitude', 'License_Age_Days', 
    'Days_Until_Expiration', 'Zip Code'
]

# Create feature dataframe
df_features = df_processed[feature_columns].copy()

# Drop rows with missing critical values
df_features = df_features.dropna(subset=['Latitude', 'Longitude'])

print(f'Feature engineering complete!')
print(f'Features shape: {df_features.shape}')
print(f'\nFeatures: {df_features.columns.tolist()}')
print(f'\nFeature data types:\n{df_features.dtypes}')

## 6. Noise Complaint Integration (When Available)

This section will integrate noise complaint data when available. For demonstration, we'll create synthetic target variables.

In [None]:
# For demonstration purposes, create synthetic target variables
# In a real scenario, these would come from spatial join with noise complaint data

# Target 1: Number of noise complaints within radius (regression)
# Simulate based on license type - restaurants typically generate more noise
np.random.seed(RANDOM_STATE)
base_complaints = np.random.poisson(lam=5, size=len(df_features))
restaurant_multiplier = (df_features['Description'] == 'Restaurant').astype(int) * 3
df_features['Noise_Complaints_Count'] = base_complaints + restaurant_multiplier + np.random.poisson(lam=2, size=len(df_features))

# Target 2: High noise area classification (binary)
df_features['High_Noise_Area'] = (df_features['Noise_Complaints_Count'] > df_features['Noise_Complaints_Count'].median()).astype(int)

print('Target variables created:')
print(f'- Noise_Complaints_Count (regression target): min={df_features["Noise_Complaints_Count"].min()}, max={df_features["Noise_Complaints_Count"].max()}, mean={df_features["Noise_Complaints_Count"].mean():.2f}')
print(f'- High_Noise_Area (classification target): {df_features["High_Noise_Area"].value_counts().to_dict()}')

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df_features['Noise_Complaints_Count'], bins=20, edgecolor='black')
axes[0].set_title('Distribution of Noise Complaints Count')
axes[0].set_xlabel('Number of Complaints')
axes[0].set_ylabel('Frequency')

df_features['High_Noise_Area'].value_counts().plot(kind='bar', ax=axes[1])
axes[1].set_title('High vs Low Noise Areas')
axes[1].set_xlabel('High Noise Area (1=Yes, 0=No)')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(['Low', 'High'], rotation=0)

plt.tight_layout()
plt.show()

## 7. Machine Learning Pipeline - Classification

Build a pipeline to predict high noise areas based on liquor license characteristics.

In [None]:
# Prepare data for classification
X = df_features.drop(columns=['Noise_Complaints_Count', 'High_Noise_Area'])
y_classification = df_features['High_Noise_Area']

# Identify numeric and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f'Numeric features: {numeric_features}')
print(f'Categorical features: {categorical_features}')

# Create preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create full pipeline with classifier
clf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, max_depth=10))
])

print('\nClassification pipeline created!')

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_classification, test_size=0.2, random_state=RANDOM_STATE, stratify=y_classification
)

print(f'Training set size: {len(X_train)}')
print(f'Test set size: {len(X_test)}')
print(f'\nClass distribution in training set:\n{y_train.value_counts()}')

# Train the model
print('\nTraining classification model...')
clf_pipeline.fit(X_train, y_train)
print('Training complete!')

# Make predictions
y_pred_train = clf_pipeline.predict(X_train)
y_pred_test = clf_pipeline.predict(X_test)

# Evaluate
train_score = clf_pipeline.score(X_train, y_train)
test_score = clf_pipeline.score(X_test, y_test)

print(f'\n=== Classification Results ===')
print(f'Training Accuracy: {train_score:.4f}')
print(f'Test Accuracy: {test_score:.4f}')
print(f'\nTest Set Classification Report:')
print(classification_report(y_test, y_pred_test, target_names=['Low Noise', 'High Noise']))

In [None]:
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Low Noise', 'High Noise'],
            yticklabels=['Low Noise', 'High Noise'])
plt.title('Confusion Matrix - High Noise Area Classification')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

In [None]:
# Feature importance analysis
try:
    # Get feature names after preprocessing
    feature_names = (numeric_features + 
                    clf_pipeline.named_steps['preprocessor']
                    .named_transformers_['cat']
                    .named_steps['onehot']
                    .get_feature_names_out(categorical_features).tolist())
    
    # Get feature importances
    importances = clf_pipeline.named_steps['classifier'].feature_importances_
    
    # Create dataframe and sort
    feature_importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importances
    }).sort_values('importance', ascending=False).head(15)
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.barh(range(len(feature_importance_df)), feature_importance_df['importance'])
    plt.yticks(range(len(feature_importance_df)), feature_importance_df['feature'])
    plt.xlabel('Importance')
    plt.title('Top 15 Feature Importances for Noise Prediction')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
    print('\nTop 10 Most Important Features:')
    print(feature_importance_df.head(10).to_string(index=False))
except Exception as e:
    print(f'Feature importance analysis skipped: {e}')

## 8. Machine Learning Pipeline - Regression

Build a pipeline to predict the number of noise complaints.

In [None]:
# Prepare data for regression
y_regression = df_features['Noise_Complaints_Count']

# Create regression pipeline
reg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE, max_depth=10))
])

# Split data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_regression, test_size=0.2, random_state=RANDOM_STATE
)

print('Training regression model...')
reg_pipeline.fit(X_train_reg, y_train_reg)
print('Training complete!')

# Make predictions
y_pred_train_reg = reg_pipeline.predict(X_train_reg)
y_pred_test_reg = reg_pipeline.predict(X_test_reg)

# Evaluate
train_r2 = r2_score(y_train_reg, y_pred_train_reg)
test_r2 = r2_score(y_test_reg, y_pred_test_reg)
train_rmse = np.sqrt(mean_squared_error(y_train_reg, y_pred_train_reg))
test_rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_test_reg))

print(f'\n=== Regression Results ===')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
print(f'Training RMSE: {train_rmse:.4f}')
print(f'Test RMSE: {test_rmse:.4f}')

In [None]:
# Plot predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test_reg, y_pred_test_reg, alpha=0.6, edgecolors='k')
plt.plot([y_test_reg.min(), y_test_reg.max()], 
         [y_test_reg.min(), y_test_reg.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Noise Complaints')
plt.ylabel('Predicted Noise Complaints')
plt.title(f'Regression: Predicted vs Actual (R² = {test_r2:.4f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Save Models and Processed Data

In [None]:
import joblib

# Create models directory
MODELS_DIR = Path('../models')
MODELS_DIR.mkdir(parents=True, exist_ok=True)

# Save models
clf_model_path = MODELS_DIR / 'noise_classifier.pkl'
reg_model_path = MODELS_DIR / 'noise_regressor.pkl'

joblib.dump(clf_pipeline, clf_model_path)
joblib.dump(reg_pipeline, reg_model_path)

print(f'Classification model saved to: {clf_model_path}')
print(f'Regression model saved to: {reg_model_path}')

# Save processed features
processed_data_path = PROCESSED_DIR / 'features_with_targets.csv'
df_features.to_csv(processed_data_path, index=False)
print(f'\nProcessed features saved to: {processed_data_path}')

## 10. Summary and Next Steps

### What We've Accomplished:
1. ✅ Loaded and explored NYC SLA liquor license data
2. ✅ Performed data preprocessing and feature engineering
3. ✅ Created synthetic noise complaint targets for demonstration
4. ✅ Built and evaluated classification pipeline (High/Low noise areas)
5. ✅ Built and evaluated regression pipeline (Noise complaint count)
6. ✅ Saved trained models and processed data

### Next Steps:
1. **Integrate Real 311 Data**: Download NYC 311 noise complaint data and place in `data/raw/311_noise.csv`
2. **Spatial Analysis**: Perform spatial joins to match liquor licenses with nearby noise complaints
3. **Advanced Features**: 
   - Time-based features (day of week, time of day for complaints)
   - Density features (number of licenses in area)
   - Demographic data integration
4. **Model Optimization**: Hyperparameter tuning, feature selection, ensemble methods
5. **Deployment**: Create API or web interface for predictions

### Model Performance:
- Classification Accuracy: {test_score:.2%}
- Regression R²: {test_r2:.4f}

The pipeline is now ready to be adapted with real noise complaint data for a complete analysis.