<a href="https://colab.research.google.com/github/dreamsmartins/rough-sleepers-exploration/blob/main/rough_sleepers_explore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a rich dataset (CHAIN data on rough sleeping in London), and it works perfectly for the assessment requirements. Since the assessment asks for a **code notebook**, I have designed this response to provide the **structure, code, and written justifications** you can use to build that notebook.

Here are three motivated hypotheses, the code to test them, and the normative evaluation required by the assessment.

### **Part 1: The Code Notebook Content**

You can copy the code blocks below into a Jupyter Notebook.

#### **Setup and Data Cleaning**

First, we need to prepare the data. The dataset mixes aggregate rows (GLA Total) and specific locations (Heathrow) with Borough data. We need to isolate the Boroughs for accurate testing.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset (assuming the file is locally saved as 'rough_sleeping.csv')
df = pd.read_csv('rough_sleeping.csv')

# 1. CLEANING
# Filter to keep only the 32 boroughs + City of London.
# We exclude "Greater London Authority" (Total), "Bus route", "Tube line", "Heathrow".
# We also drop the 'GSS Code' as it's not needed for analysis.
exclude_rows = ['Greater London Authority', 'Bus route', 'Tube line', 'Heathrow']
borough_df = df[~df['Area'].isin(exclude_rows)].copy()

# Melt the dataframe to make it long-form (easier for stats)
# We treat the Quarters as a time variable.
borough_long = borough_df.melt(id_vars=['Area', 'GSS Code'],
                               var_name='Quarter',
                               value_name='Count')

# Convert 'Quarter' to a comparable numeric time index for regression
# (e.g., 2023-24 Q3 = 0, 2023-24 Q4 = 1, etc.)
unique_quarters = borough_long['Quarter'].unique()
quarter_map = {q: i for i, q in enumerate(unique_quarters)}
borough_long['Time_Index'] = borough_long['Quarter'].map(quarter_map)

print(f"Data ready. Observations: {len(borough_long)}")

-----

#### **Hypothesis 1: The "Inner City" Concentration Effect**

**Hypothesis:** *Rough sleeping counts are statistically significantly higher in Inner London boroughs compared to Outer London boroughs.*

**Reflection (Motivation):**

> *Why this hypothesis?* Urban geography literature consistently suggests that homelessness is spatially concentrated in city centers. This is often due to the centralization of support services (soup kitchens, shelters), the "anonymity" of crowds, and higher footfall for begging. Testing this verifies if London follows the standard urban model or if the housing crisis has pushed rough sleeping equally into the suburbs.

**The Code:**

In [None]:
# Define Inner London Boroughs (Standard ONS definition)
inner_london = [
    'Camden', 'City of London', 'Hackney', 'Hammersmith & Fulham', 'Haringey',
    'Islington', 'Kensington & Chelsea', 'Lambeth', 'Lewisham', 'Newham',
    'Southwark', 'Tower Hamlets', 'Wandsworth', 'Westminster'
]

# Create a binary column
borough_long['Location_Type'] = borough_long['Area'].apply(
    lambda x: 'Inner' if x in inner_london else 'Outer'
)

# Extract the two groups
inner_counts = borough_long[borough_long['Location_Type'] == 'Inner']['Count']
outer_counts = borough_long[borough_long['Location_Type'] == 'Outer']['Count']

# Conduct Independent T-Test
t_stat, p_val = stats.ttest_ind(inner_counts, outer_counts, equal_var=False)

print(f"Hypothesis 1 Results (Inner vs Outer):")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4e}")
print(f"Inner London Mean: {inner_counts.mean():.2f}")
print(f"Outer London Mean: {outer_counts.mean():.2f}")

-----

#### **Hypothesis 2: Seasonal Variation (The "Winter Shelter" Effect)**

**Hypothesis:** *There is a statistically significant difference in rough sleeping counts between Quarter 3 (Oct-Dec) and Quarter 2 (July-Sept).*

**Reflection (Motivation):**

> *Why this hypothesis?* This hypothesis draws on the concept of "Severe Weather Emergency Protocol" (SWEP). In colder months (Q3/Q4), local authorities are often legally mandated to open emergency shelters when temperatures drop. This might paradoxically cause *street counts* to drop in winter compared to summer, not because homelessness has decreased, but because more people are temporarily indoors. Uninformed speculation might assume winter counts are higher due to hardship; motivated testing looks for the administrative effect of shelter openings.

**The Code:**

In [None]:
# Filter for specific quarters to compare seasonality
# Q2 = July-Sept (Summer-ish), Q3 = Oct-Dec (Winter start)
q2_data = borough_long[borough_long['Quarter'].str.contains('Q2')]['Count']
q3_data = borough_long[borough_long['Quarter'].str.contains('Q3')]['Count']

# Conduct T-Test
t_stat_season, p_val_season = stats.ttest_ind(q2_data, q3_data)

print(f"Hypothesis 2 Results (Summer Q2 vs Winter Q3):")
print(f"P-value: {p_val_season:.4f}")
print(f"Q2 (Summer) Mean: {q2_data.mean():.2f}")
print(f"Q3 (Winter) Mean: {q3_data.mean():.2f}")

-----

#### **Hypothesis 3: Westminster as an Outlier (The "Magnet" Effect)**

**Hypothesis:** *The borough of Westminster represents a statistical outlier that skews the London-wide mean, acting as a distinct population from the rest of London.*

**Reflection (Motivation):**

> *Why this hypothesis?* Westminster is unique in the UK for its rough sleeping density. It is a transit hub (Victoria Coach Station) and a tourist center. In homelessness datasets, "average" figures are often misleading because one "super-borough" skews the data. Identifying Westminster as a statistical outlier is crucial for normative evaluation—policy that works for the "average" London borough might fail in Westminster, and vice versa.

**The Code:**

In [None]:
# Calculate Z-scores for the most recent quarter (2025-26 Q2)
latest_q = '2025-26 Q2'
current_data = borough_df[['Area', latest_q]].copy()
current_data['Z_Score'] = stats.zscore(current_data[latest_q])

# Identify outliers (Z-score > 3 is standard cutoff)
outliers = current_data[current_data['Z_Score'] > 3]

print(f"Hypothesis 3 Results (Outliers in {latest_q}):")
print(outliers)

-----

### **Part 2: Normative Factors & Bias Evaluation**

This is the second part of your assessment where you must critique the dataset itself. You can include this as markdown cells in your notebook.

#### **1. The "Visible vs. Hidden" Bias**

  * **The Problem:** This dataset counts people **"Seen Rough Sleeping."** It is a "flow" count of visible street homelessness.
  * **Literature Connection:** Academic literature distinguishes between "rough sleeping" and the broader definition of "statutory homelessness." This dataset conceals the "hidden homeless"—people sofa-surfing, living in temporary accommodation (B\&Bs), or squatting.
  * **Normative Implication:** If policy makers only use this dataset, they may allocate funding effectively to *street* outreach but completely underfund *prevention* for families in temporary housing, who are not captured here.

#### **2. The "Outreach Intensity" Bias**

  * **The Problem:** CHAIN data is generated by outreach teams finding people. Therefore, the data is a function of **observation effort**.
  * **Literature Connection:** Biases in administrative data often arise where resources are higher. Westminster has heavily funded outreach teams patrolling frequently; an outer borough like Sutton might have fewer patrols.
  * **Normative Implication:** A low number in an outer borough might not mean "no rough sleeping"—it might just mean "no one is looking." The dataset might be measuring *outreach activity* rather than *homelessness prevalence*.

#### **3. The "Transient Population" Misrepresentation**

  * **The Problem:** The inclusion of Heathrow, Bus Routes, and Tube Lines (at the bottom of your CSV) as separate categories is tricky.
  * **Literature Connection:** Rough sleepers are mobile. A person counted on a "Bus Route" might also be counted in "Camden" the next day.
  * **Normative Implication:** There is a risk of double-counting or misallocation of responsibility. If a person sleeps on a night bus because it's safer than the street, they are mobile. Assigning them to a "Bus" category absolves the specific Borough they came from of the responsibility to house them.

### **Next Steps**

Would you like me to refine the Python code to include **visualizations** (like a boxplot for Inner vs Outer London) to make the notebook more "technically proficient" as requested?