# Session 1: Data Science Mindset + Pandas Foundations

## What Data Science Actually Is

Data science is not a library, a model, or a notebook.
It is a **decision-making discipline under uncertainty**.

A data scientist’s job is to:
- Translate a vague real-world problem into a precise question
- Decide what data is relevant and what is noise
- Make assumptions explicit instead of hiding them in code
- Test whether the data supports or contradicts those assumptions

If you jump to modeling, you are not doing data science.
You are guessing with math.

---

## The Core Mental Loop (Non-Negotiable)

Every real project follows this loop:

**Problem → Data → Assumptions → Cleaning → EDA → Features → Model → Evaluation → Iteration**

You do NOT start at “Model”.
If you do, every result is unreliable.

Key idea:
> Models do not create insight.  
> Decisions made *before* the model do.

---

## Machine Learning Is a Tool, Not the Goal

Machine learning exists to answer **one of three types of questions**:

### 1. Supervised Learning (Target Exists)
You already know what you want to predict.

- Example:
  - Input: house size, location, quality
  - Target: house price

If there is a target column in your dataset, you are in supervised learning.

---

### 2. Unsupervised Learning (No Target)
You do NOT know the answer in advance.

- Example:
  - Group houses into similar segments
  - Discover patterns without labels

Unsupervised learning does NOT mean “better” or “smarter”.
It means **more assumptions and weaker guarantees**.

---

## Problem Types (Chosen by the Target)

The **type of target** determines the problem type — not your preference.

### Regression
- Target is a number
- Example: price, temperature, revenue
- Question: “How much?”

### Classification
- Target is a category
- Example: cheap vs expensive, fraud vs not fraud
- Question: “Which class?”

### Clustering
- No target
- Output is structure, not truth
- Question: “What groups exist?”

You do NOT choose the model first.
You choose the **question** first.

---

## The First Question You Always Ask

Before touching pandas, sklearn, or charts, ask:

**“What is the target, and why does it matter?”**

If you cannot answer this:
- You cannot clean data correctly
- You cannot evaluate results
- You cannot explain outcomes

---

## Why Raw Data Is Always Wrong

Raw data is never ready because:
- Values are missing
- Types are incorrect
- Units are inconsistent
- Outliers exist
- Columns mix signal with noise
- Data reflects how it was collected, not reality

Every transformation you make encodes an assumption.

Good data scientists:
- Know what assumptions they are making
- Can explain them in plain language

Bad data scientists:
- Let libraries make assumptions for them

---

## Rules for This Session

- We will NOT build models
- We will NOT chase accuracy
- We will focus on understanding the data
- Every code step must answer a question

If you cannot explain *why* you ran a line of code,
that line should not exist.


In [23]:
import pandas as pd
import numpy as np

rng = np.random.default_rng(42)

n = 1500  # enough rows to make shape/info/describe useful

neighborhoods = ["NAmes", "CollgCr", "OldTown", "Edwards", "Somerst", "NridgHt", "Sawyer", "BrkSide", "Crawfor", "Mitchel"]
kitchen_qual = ["Fa", "TA", "Gd", "Ex"]
sale_conditions = ["Normal", "Abnorml", "Family", "Partial"]

# Core drivers
overall_qual = rng.integers(1, 11, size=n)  # 1-10
year_built = rng.integers(1950, 2021, size=n)
gr_liv_area = np.clip(rng.normal(1500, 500, size=n), 400, 5000).round().astype(int)

# Add a handful of very large houses (outliers)
outlier_idx = rng.choice(n, size=12, replace=False)
gr_liv_area[outlier_idx] = rng.integers(4200, 6500, size=len(outlier_idx))

lot_area = np.clip(rng.lognormal(mean=np.log(8000), sigma=0.5, size=n), 1200, 60000).round().astype(int)
full_bath = np.clip((gr_liv_area / 900 + rng.normal(0, 0.5, size=n)).round(), 1, 4).astype(int)

garage_cars = np.clip((overall_qual / 4 + rng.normal(0, 0.6, size=n)).round(), 0, 4).astype(int)
fireplaces = np.clip((overall_qual / 5 + rng.normal(0, 0.7, size=n)).round(), 0, 3).astype(int)

central_air = rng.choice(["Y", "N"], size=n, p=[0.88, 0.12])
neighborhood = rng.choice(neighborhoods, size=n, p=[0.16, 0.14, 0.10, 0.10, 0.10, 0.10, 0.08, 0.07, 0.08, 0.07])
kitchen = rng.choice(kitchen_qual, size=n, p=[0.06, 0.54, 0.30, 0.10])
sale_condition = rng.choice(sale_conditions, size=n, p=[0.80, 0.08, 0.06, 0.06])

distance_km = np.clip(rng.normal(8, 4, size=n), 0.2, 25).round(2)

# Neighborhood multipliers (simulate location effect)
nb_mult = {
    "NridgHt": 1.35,
    "Somerst": 1.15,
    "CollgCr": 1.10,
    "Crawfor": 1.12,
    "NAmes": 1.00,
    "Mitchel": 0.95,
    "Sawyer": 0.92,
    "Edwards": 0.90,
    "BrkSide": 0.88,
    "OldTown": 0.85,
}
nb_factor = np.array([nb_mult[x] for x in neighborhood])

# Kitchen quality multipliers
k_mult = {"Fa": 0.92, "TA": 1.00, "Gd": 1.08, "Ex": 1.18}
k_factor = np.array([k_mult[x] for x in kitchen])

# Central air multiplier
ca_factor = np.where(central_air == "Y", 1.03, 0.97)

# Distance penalty (further from city center often cheaper, but not always)
dist_factor = 1.0 - (distance_km / 100)

# Price generation (intentionally not perfect linearity)
base = 25000
price = (
    base
    + gr_liv_area * 95
    + lot_area * 0.7
    + overall_qual * 14000
    + full_bath * 8500
    + garage_cars * 11000
    + fireplaces * 4500
)

price = price * nb_factor * k_factor * ca_factor * dist_factor

# Add noise (market randomness)
noise = rng.normal(0, 25000, size=n)
sale_price = np.clip(price + noise, 45000, 850000).round(-1).astype(int)

# Introduce a few weird cheap/expensive anomalies for discussion
anom_idx = rng.choice(n, size=8, replace=False)
sale_price[anom_idx[:4]] = np.clip(sale_price[anom_idx[:4]] * 0.6, 45000, None).round(-1).astype(int)
sale_price[anom_idx[4:]] = np.clip(sale_price[anom_idx[4:]] * 1.4, None, 850000).round(-1).astype(int)

df = pd.DataFrame({
    "Id": np.arange(1, n + 1),
    "Neighborhood": neighborhood,
    "YearBuilt": year_built,
    "LotArea": lot_area,
    "OverallQual": overall_qual,
    "GrLivArea": gr_liv_area,
    "FullBath": full_bath,
    "GarageCars": garage_cars,
    "Fireplaces": fireplaces,
    "CentralAir": central_air,
    "KitchenQual": kitchen,
    "DistanceToCityCenterKm": distance_km,
    "SaleCondition": sale_condition,
    "SalePrice": sale_price,
})

# Inject missing values (so df.info() shows non-null counts and you can teach NaNs next session)
for col, frac in {
    "GarageCars": 0.03,
    "KitchenQual": 0.02,
    "DistanceToCityCenterKm": 0.015,
    "LotArea": 0.01,
}.items():
    idx = rng.choice(n, size=int(n * frac), replace=False)
    df.loc[idx, col] = np.nan

# A column with "messy" dtype potential (numeric as string) for later teaching if you want
# Uncomment if you want dtype issues in Session 2/3:
# df.loc[rng.choice(n, size=20, replace=False), "YearBuilt"] = df["YearBuilt"].astype(str)

df.to_csv("house_prices.csv", index=False)

---------------------------------------------------------------------------------------------------------------------------------------

Cell 4 — Import Libraries (Code)

In [24]:
# Import Libraries
import pandas as pd
import numpy as np

---------------------------------------------------------------------------------------------------------------------------------------

Cell 5 — Load Dataset (Code)

In [25]:
df = pd.read_csv("house_prices.csv")

# Always print shape immediately
df.shape

(1500, 14)

Rows = observations

Columns = features

Shape tells you scale and feasibility

---------------------------------------------------------------------------------------------------------------------------------------

Cell 6 — First Look at Data (Code)

In [26]:
df.head()

Unnamed: 0,Id,Neighborhood,YearBuilt,LotArea,OverallQual,GrLivArea,FullBath,GarageCars,Fireplaces,CentralAir,KitchenQual,DistanceToCityCenterKm,SaleCondition,SalePrice
0,1,Edwards,1987,5405.0,1,1981,2,0.0,0,Y,TA,2.73,Normal,230100
1,2,Mitchel,1968,7515.0,8,1138,1,2.0,2,Y,TA,6.75,Normal,267240
2,3,Crawfor,1966,5897.0,7,1073,2,1.0,0,Y,Gd,13.85,Normal,283400
3,4,Somerst,1972,25261.0,5,1730,2,0.0,1,Y,TA,5.09,Normal,347970
4,5,NAmes,1992,7649.0,5,1994,2,1.0,2,Y,TA,5.61,Normal,307380


What looks like an identifier?

What looks numeric vs categorical?

Anything suspicious?

---------------------------------------------------------------------------------------------------------------------------------------

Cell 7 — Dataset Structure (Code)

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Id                      1500 non-null   int64  
 1   Neighborhood            1500 non-null   object 
 2   YearBuilt               1500 non-null   int64  
 3   LotArea                 1485 non-null   float64
 4   OverallQual             1500 non-null   int64  
 5   GrLivArea               1500 non-null   int64  
 6   FullBath                1500 non-null   int64  
 7   GarageCars              1455 non-null   float64
 8   Fireplaces              1500 non-null   int64  
 9   CentralAir              1500 non-null   object 
 10  KitchenQual             1470 non-null   object 
 11  DistanceToCityCenterKm  1478 non-null   float64
 12  SaleCondition           1500 non-null   object 
 13  SalePrice               1500 non-null   int64  
dtypes: float64(3), int64(7), object(4)
memor

---------------------------------------------------------------------------------------------------------------------------------------

Cell 8 — Quick Statistics (Code)

In [28]:
df.describe()

Unnamed: 0,Id,YearBuilt,LotArea,OverallQual,GrLivArea,FullBath,GarageCars,Fireplaces,DistanceToCityCenterKm,SalePrice
count,1500.0,1500.0,1485.0,1500.0,1500.0,1500.0,1455.0,1500.0,1478.0,1500.0
mean,750.5,1984.830667,9108.237037,5.552,1515.328667,1.741333,1.403436,1.136667,7.969567,291028.9
std,433.157015,20.612648,4960.506376,2.874393,615.413323,0.726712,0.961495,0.877783,3.895349,95018.749226
min,1.0,1950.0,2023.0,1.0,400.0,1.0,0.0,0.0,0.2,72800.0
25%,375.75,1967.0,5723.0,3.0,1149.75,1.0,1.0,0.0,5.25,223887.5
50%,750.5,1984.0,8001.0,6.0,1504.5,2.0,1.0,1.0,7.83,283250.0
75%,1125.25,2003.0,10975.0,8.0,1831.0,2.0,2.0,2.0,10.705,344205.0
max,1500.0,2020.0,44991.0,10.0,6325.0,4.0,4.0,3.0,19.4,850000.0


What columns are missing?

Why didn’t categoricals appear?

Does the mean price represent a real house?

Rule:
Never trust describe() alone.

---------------------------------------------------------------------------------------------------------------------------------------

## Columns Are Not Just Columns

For EACH column, we must ask:
- What does this represent in the real world?
- What unit is it measured in?
- Is it an input, output, or identifier?

Identifiers should almost NEVER be features.


---------------------------------------------------------------------------------------------------------------------------------------

Cell 10 — List Columns (Code)

In [29]:
df.columns

Index(['Id', 'Neighborhood', 'YearBuilt', 'LotArea', 'OverallQual',
       'GrLivArea', 'FullBath', 'GarageCars', 'Fireplaces', 'CentralAir',
       'KitchenQual', 'DistanceToCityCenterKm', 'SaleCondition', 'SalePrice'],
      dtype='object')

Exercise:

Mark each column as:

Target

Feature

Identifier

Unknown

---------------------------------------------------------------------------------------------------------------------------------------

Cell 11 — Selecting Columns (Code)

In [30]:
# Select a single column
df["SalePrice"].head()

# Select multiple columns
df[["SalePrice", "GrLivArea", "OverallQual"]].head()

Unnamed: 0,SalePrice,GrLivArea,OverallQual
0,230100,1981,1
1,267240,1138,8
2,283400,1073,7
3,347970,1730,5
4,307380,1994,5


Explicit column selection prevents accidental leakage

---------------------------------------------------------------------------------------------------------------------------------------

Cell 12 — Row Filtering (Code)

In [31]:
# Example: houses with very large living area
df[df["GrLivArea"] > 3000].head()

Unnamed: 0,Id,Neighborhood,YearBuilt,LotArea,OverallQual,GrLivArea,FullBath,GarageCars,Fireplaces,CentralAir,KitchenQual,DistanceToCityCenterKm,SaleCondition,SalePrice
6,7,Sawyer,2014,7255.0,1,6325,4,0.0,0,Y,TA,,Normal,631420
96,97,Mitchel,1996,3647.0,3,5747,4,1.0,0,Y,Gd,9.14,Abnorml,626940
345,346,Somerst,1993,6841.0,3,5094,4,1.0,1,Y,TA,3.99,Normal,679680
380,381,Edwards,1977,5863.0,1,4742,4,1.0,0,Y,,13.48,Normal,538160
557,558,Edwards,2002,9539.0,9,5073,4,3.0,2,Y,TA,12.1,Normal,609470


Ask:

Are these real houses or outliers?

Would these influence price heavily?

---------------------------------------------------------------------------------------------------------------------------------------

Cell 13 — Creating New Columns (Code)

In [32]:
# Example: price per square foot
df["PricePerSqFt"] = df["SalePrice"] / df["GrLivArea"]

df[["SalePrice", "GrLivArea", "PricePerSqFt"]].head()


Unnamed: 0,SalePrice,GrLivArea,PricePerSqFt
0,230100,1981,116.153458
1,267240,1138,234.83304
2,283400,1073,264.119292
3,347970,1730,201.138728
4,307380,1994,154.152457


Teach:

Features encode assumptions

This assumes linear value per area (might be wrong)

---------------------------------------------------------------------------------------------------------------------------------------

Cell 14 — Renaming Columns (Code)

In [33]:
df = df.rename(columns={
    "GrLivArea": "LivingAreaSqFt",
    "SalePrice": "HousePrice"
})

df.columns


Index(['Id', 'Neighborhood', 'YearBuilt', 'LotArea', 'OverallQual',
       'LivingAreaSqFt', 'FullBath', 'GarageCars', 'Fireplaces', 'CentralAir',
       'KitchenQual', 'DistanceToCityCenterKm', 'SaleCondition', 'HousePrice',
       'PricePerSqFt'],
      dtype='object')

Rule:

Clean names improve thinking

Code clarity affects reasoning quality

-----------------------------------------------------------------------------------------------------------------------------------------

Cell 15 — Why Raw Data Is Always Wrong (Markdown)

## Why Raw Data Is Always Wrong

Raw data problems:
- Missing values
- Incorrect types
- Outliers
- Measurement errors
- Hidden assumptions

If data were clean:
- Data scientists would not exist


-----------------------------------------------------------------------------------------------------------------------------------------

Cell 16 — Thinking Habits (Markdown)

## Thinking Habits to Enforce

Always ask:
- What is the target?
- What unit is each column?
- Which columns are inputs vs identifiers?

Never:
- Model before inspecting
- Trust summary stats blindly
- Assume missing values are random

## Exercise

1. What problem could this dataset solve?
2. What is the target?
3. Which columns should NOT be used as features?
4. What assumptions did we already make today?