
# 🐼 Data Manipulation with Pandas - Jupyter Notes

## 📌 Introduction

**pandas** is a powerful Python library for **data manipulation and analysis**.  
It builds upon:
- **NumPy** (for efficient numerical computation)
- **Matplotlib** (for visualization)

pandas is especially good at handling **rectangular/tabular data**, also known as **DataFrames**.

---

## 📊 What is a DataFrame?

A **DataFrame** is:
- A table-like data structure (rows × columns)
- Rows = Observations (e.g., each dog)
- Columns = Variables (e.g., breed, height)

Every column contains values of the same data type, but different columns may have different types.

---

## 🐕 Sample DataFrame (Dogs Dataset)

```python
import pandas as pd

# Sample data
data = {
    "name": ["Bella", "Charlie", "Lucy", "Cooper", "Max", "Stella", "Bernie"],
    "breed": ["Labrador", "Poodle", "Chow Chow", "Schnauzer", "Labrador", "Chihuahua", "St. Bernard"],
    "color": ["Brown", "Black", "Brown", "Gray", "Black", "Tan", "White"],
    "height_cm": [56, 43, 46, 49, 59, 18, 77],
    "weight_kg": [24, 24, 24, 17, 29, 2, 74],
    "date_of_birth": ["2013-07-01", "2016-09-16", "2014-08-25", "2011-12-11", "2017-01-20", "2015-04-20", "2018-02-27"]
}

dogs = pd.DataFrame(data)
````

---

## 🔍 Exploring a DataFrame

### 1. `.head()` – First few rows

```python
print(dogs.head())
```

> Returns first 5 rows by default (helpful for large datasets).

📤 **Output:**

```
      name       breed  color  height_cm  weight_kg date_of_birth
0    Bella    Labrador  Brown         56         24    2013-07-01
1  Charlie      Poodle  Black         43         24    2016-09-16
2     Lucy   Chow Chow  Brown         46         24    2014-08-25
3   Cooper   Schnauzer   Gray         49         17    2011-12-11
4      Max    Labrador  Black         59         29    2017-01-20
```

---

### 2. `.info()` – Summary of the DataFrame

```python
print(dogs.info())
```

> Shows column names, data types, non-null counts.

📤 **Output:**

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           7 non-null      object
 1   breed          7 non-null      object
 2   color          7 non-null      object
 3   height_cm      7 non-null      int64 
 4   weight_kg      7 non-null      int64 
 5   date_of_birth  7 non-null      object
dtypes: int64(2), object(4)
```

🧠 **Note:** Use `.info()` to check for missing values and data types quickly.

---

### 3. `.shape` – Dimensions

```python
print(dogs.shape)
```

📤 **Output:**

```
(7, 6)
```

> 7 rows, 6 columns → Tuple format `(rows, columns)`

---

### 4. `.describe()` – Summary statistics for numerical columns

```python
print(dogs.describe())
```

📤 **Output:**

```
       height_cm  weight_kg
count    7.000000   7.000000
mean    49.714286  27.428571
std     17.960274  22.292429
min     18.000000   2.000000
25%     44.500000  19.500000
50%     49.000000  23.000000
75%     57.500000  27.000000
max     77.000000  74.000000
```

> Good for a quick overview of numeric variables (mean, median, range).

---

### 5. `.values` – Access underlying NumPy array

```python
print(dogs.values)
```

📤 **Output:**

```
array([['Bella', 'Labrador', 'Brown', 56, 24, '2013-07-01'],
       ['Charlie', 'Poodle', 'Black', 43, 24, '2016-09-16'],
       ...
       ['Bernie', 'St. Bernard', 'White', 77, 74, '2018-02-27']], dtype=object)
```

> 2D array of the full data (not recommended for column access).

---

### 6. `.columns` and `.index` – Labels

```python
print(dogs.columns)
print(dogs.index)
```

📤 **Output:**

```
Index(['name', 'breed', 'color', 'height_cm', 'weight_kg', 'date_of_birth'], dtype='object')
RangeIndex(start=0, stop=7, step=1)
```

> `.columns` gives column names
> `.index` gives row labels (can be numbers or names)

---

## 🧠 pandas Philosophy

* Inspired by Python’s **"There should be one– and preferably only one –obvious way to do it."**
* But pandas often gives you **multiple options** to do the same task.
* This makes it **powerful** but can be **confusing**.
* In this course/guide, we stick to the **simplest & most common methods**.

🧰 Think of pandas like a **Swiss Army Knife** — many tools, use what fits best.

---

## ✅ Recap

| Method        | Purpose                        |
| ------------- | ------------------------------ |
| `.head()`     | View first few rows            |
| `.info()`     | Summary: columns, types, nulls |
| `.shape`      | Dimensions (rows, columns)     |
| `.describe()` | Summary stats for numeric data |
| `.values`     | Raw data as a NumPy array      |
| `.columns`    | Column labels                  |
| `.index`      | Row labels                     |

---


In [None]:
# 📘 Exercise: Inspecting a DataFrame

# 🧠 Goal: When working with a new DataFrame, first explore it using these tools:
# - .head(): Shows the first few rows (default = 5)
# - .info(): Gives column names, data types, and null counts
# - .shape: Tells you the number of rows and columns
# - .describe(): Gives summary statistics for numeric columns

# 📋 Dataset Description:
# The 'homelessness' DataFrame contains data on homelessness in each U.S. state for 2018.
# Columns:
# - 'region': U.S. region name
# - 'state': State name
# - 'individuals': Homeless people NOT in families
# - 'family_members': Homeless people in families
# - 'state_pop': Total population of the state

# 🔍 Step-by-step inspection:

# 1️⃣ Print the first 5 rows
print("📍 Head of the DataFrame:")
print(homelessness.head())

# Output:
#                    region       state  individuals  family_members  state_pop
# 0  East South Central     Alabama       2570.0           864.0    4887681
# 1             Pacific      Alaska       1434.0           582.0     735139
# 2            Mountain     Arizona       7259.0          2606.0    7158024
# 3  West South Central    Arkansas       2280.0           432.0    3009733
# 4             Pacific  California     109008.0         20964.0   39461588

# 2️⃣ Show full info of the DataFrame
print("\n📍 Info about the DataFrame:")
print(homelessness.info())

# Output:
# <class 'pandas.core.frame.DataFrame'>
# Index: 51 entries, 0 to 50
# Data columns (total 5 columns):
#  #   Column          Non-Null Count  Dtype  
# ---  ------          --------------  -----  
#  0   region          51 non-null     object 
#  1   state           51 non-null     object 
#  2   individuals     51 non-null     float64
#  3   family_members  51 non-null     float64
#  4   state_pop       51 non-null     int64  
# dtypes: float64(2), int64(1), object(2)
# memory usage: 2.0+ KB
# None

# 3️⃣ Get the shape (rows, columns)
print("\n📍 Shape of the DataFrame:")
print(homelessness.shape)

# Output:
# (51, 5)

# 4️⃣ Get summary statistics for numeric columns
print("\n📍 Description of numeric columns:")
print(homelessness.describe())

# Output:
#        individuals  family_members     state_pop
# count      51.000000       51.000000  5.100000e+01
# mean     7225.784314     3504.882353  6.405637e+06
# std     15991.025083     7805.411811  7.327258e+06
# min       434.000000       75.000000  5.776010e+05
# 25%      1446.500000      592.000000  1.777414e+06
# 50%      3082.000000     1482.000000  4.461153e+06
# 75%      6781.500000     3196.000000  7.340946e+06
# max    109008.000000    52070.000000  3.946159e+07

# ✅ Summary:
# This exploration step helps you understand:
# - What's in your dataset
# - What type of data each column has
# - If there are missing values
# - General trends in numeric columns



In [None]:
# 🧪 Exercise: Parts of a DataFrame (100 XP)

# 🎯 Objective:
# To better understand DataFrame objects by exploring their key components:
# - .values: A 2D NumPy array of the actual data
# - .columns: The column names of the DataFrame
# - .index: The row labels (typically default integers)

# 📌 homelessness DataFrame is already available

# ✅ Step 1: Import pandas
import pandas as pd

# ✅ Step 2: Print the values (as a NumPy 2D array)
print("🔹 Data Values (.values):")
print(homelessness.values)

# ✅ Step 3: Print the column names
print("\n🔹 Column Names (.columns):")
print(homelessness.columns)

# ✅ Step 4: Print the row index
print("\n🔹 Row Index (.index):")
print(homelessness.index)

# 🧾 Expected Output:

# 🔹 Data Values (.values):
# [['East South Central' 'Alabama' 2570.0 864.0 4887681]
#  ['Pacific' 'Alaska' 1434.0 582.0 735139]
#  ['Mountain' 'Arizona' 7259.0 2606.0 7158024]
#  ['West South Central' 'Arkansas' 2280.0 432.0 3009733]
#  ['Pacific' 'California' 109008.0 20964.0 39461588]
#  ['Mountain' 'Colorado' 7607.0 3250.0 5691287]
#  ['New England' 'Connecticut' 2280.0 1696.0 3571520]
#  ['South Atlantic' 'Delaware' 708.0 374.0 965479]
#  ['South Atlantic' 'District of Columbia' 3770.0 3134.0 701547]
#  ['South Atlantic' 'Florida' 21443.0 9587.0 21244317]
#  ['South Atlantic' 'Georgia' 6943.0 2556.0 10511131]
#  ['Pacific' 'Hawaii' 4131.0 2399.0 1420593]
#  ['Mountain' 'Idaho' 1297.0 715.0 1750536]
#  ['East North Central' 'Illinois' 6752.0 3891.0 12723071]
#  ['East North Central' 'Indiana' 3776.0 1482.0 6695497]
#  ['West North Central' 'Iowa' 1711.0 1038.0 3148618]
#  ['West North Central' 'Kansas' 1443.0 773.0 2911359]
#  ['East South Central' 'Kentucky' 2735.0 953.0 4461153]
#  ['West South Central' 'Louisiana' 2540.0 519.0 4659690]
#  ['New England' 'Maine' 1450.0 1066.0 1339057]
#  ['South Atlantic' 'Maryland' 4914.0 2230.0 6035802]
#  ['New England' 'Massachusetts' 6811.0 13257.0 6882635]
#  ['East North Central' 'Michigan' 5209.0 3142.0 9984072]
#  ['West North Central' 'Minnesota' 3993.0 3250.0 5606249]
#  ['East South Central' 'Mississippi' 1024.0 328.0 2981020]
#  ['West North Central' 'Missouri' 3776.0 2107.0 6121623]
#  ['Mountain' 'Montana' 983.0 422.0 1060665]
#  ['West North Central' 'Nebraska' 1745.0 676.0 1925614]
#  ['Mountain' 'Nevada' 7058.0 486.0 3027341]
#  ['New England' 'New Hampshire' 835.0 615.0 1353465]
#  ['Mid-Atlantic' 'New Jersey' 6048.0 3350.0 8886025]
#  ['Mountain' 'New Mexico' 1949.0 602.0 2092741]
#  ['Mid-Atlantic' 'New York' 39827.0 52070.0 19530351]
#  ['South Atlantic' 'North Carolina' 6451.0 2817.0 10381615]
#  ['West North Central' 'North Dakota' 467.0 75.0 758080]
#  ['East North Central' 'Ohio' 6929.0 3320.0 11676341]
#  ['West South Central' 'Oklahoma' 2823.0 1048.0 3940235]
#  ['Pacific' 'Oregon' 11139.0 3337.0 4181886]
#  ['Mid-Atlantic' 'Pennsylvania' 8163.0 5349.0 12800922]
#  ['New England' 'Rhode Island' 747.0 354.0 1058287]
#  ['South Atlantic' 'South Carolina' 3082.0 851.0 5084156]
#  ['West North Central' 'South Dakota' 836.0 323.0 878698]
#  ['East South Central' 'Tennessee' 6139.0 1744.0 6771631]
#  ['West South Central' 'Texas' 19199.0 6111.0 28628666]
#  ['Mountain' 'Utah' 1904.0 972.0 3153550]
#  ['New England' 'Vermont' 780.0 511.0 624358]
#  ['South Atlantic' 'Virginia' 3928.0 2047.0 8501286]
#  ['Pacific' 'Washington' 16424.0 5880.0 7523869]
#  ['South Atlantic' 'West Virginia' 1021.0 222.0 1804291]
#  ['East North Central' 'Wisconsin' 2740.0 2167.0 5807406]
#  ['Mountain' 'Wyoming' 434.0 205.0 577601]]

# 🔹 Column Names (.columns):
# Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

# 🔹 Row Index (.index):
# RangeIndex(start=0, stop=51, step=1)


## 🐼 Data Manipulation with Pandas – Sorting and Subsetting

This section covers essential techniques for exploring and extracting meaningful parts of a DataFrame using **sorting** and **subsetting**. These are some of the most important and frequently used operations in pandas.

---

### 🔢 Sorting a DataFrame

- To sort rows, use `.sort_values()` and pass the column name.
- Use `ascending=False` to sort in descending order.
- You can also sort by **multiple columns**, specifying both the column list and a list of booleans for ascending/descending.

```python
# Sort by weight_kg (lightest to heaviest)
dogs.sort_values("weight_kg")

# Sort from heaviest to lightest
dogs.sort_values("weight_kg", ascending=False)

# Sort by multiple columns: first by weight, then height
dogs.sort_values(["weight_kg", "height_cm"])

# Sort by weight ascending, height descending
dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
````

💡 *This helps bring the most relevant data (like the smallest or tallest dogs) to the top of the table.*

---

### 📑 Subsetting Columns

To view **only specific columns** of a DataFrame:

```python
# One column
dogs["name"]

# Multiple columns (use double square brackets)
dogs[["breed", "height_cm"]]

# Using a variable list of column names
cols_to_subset = ["breed", "height_cm"]
dogs[cols_to_subset]
```

📌 Output example:

| breed       | height\_cm |
| ----------- | ---------- |
| Labrador    | 56         |
| Poodle      | 43         |
| Chow Chow   | 46         |
| Schnauzer   | 49         |
| Labrador    | 59         |
| Chihuahua   | 18         |
| St. Bernard | 77         |

---

### 📉 Subsetting Rows Using Conditions

You can filter rows using **boolean conditions**:

```python
# Boolean Series: Which dogs are taller than 50 cm?
dogs["height_cm"] > 50
```

💡 *Returns a Series of True/False values.*

```python
# Subset DataFrame with the condition
dogs[dogs["height_cm"] > 50]
```

📌 Output:

| name   | breed       | color | height\_cm | weight\_kg | date\_of\_birth |
| ------ | ----------- | ----- | ---------- | ---------- | --------------- |
| Bella  | Labrador    | Brown | 56         | 24         | 2013-07-01      |
| Max    | Labrador    | Black | 59         | 29         | 2017-01-20      |
| Bernie | St. Bernard | White | 77         | 74         | 2018-02-27      |

---

### 🔤 Subsetting Based on Text Data

```python
# Find only Labrador dogs
dogs[dogs["breed"] == "Labrador"]
```

📌 Output:

| name  | breed    | color | height\_cm | weight\_kg | date\_of\_birth |
| ----- | -------- | ----- | ---------- | ---------- | --------------- |
| Bella | Labrador | Brown | 56         | 24         | 2013-07-01      |
| Max   | Labrador | Black | 59         | 29         | 2017-01-20      |

---

### 📅 Subsetting Based on Dates

Use standard date formatting (`"YYYY-MM-DD"`) as strings.

```python
# Dogs born before 2015
dogs[dogs["date_of_birth"] < "2015-01-01"]
```

📌 Output:

| name   | breed     | color | height\_cm | weight\_kg | date\_of\_birth |
| ------ | --------- | ----- | ---------- | ---------- | --------------- |
| Bella  | Labrador  | Brown | 56         | 24         | 2013-07-01      |
| Lucy   | Chow Chow | Brown | 46         | 24         | 2014-08-25      |
| Cooper | Schnauzer | Gray  | 49         | 17         | 2011-12-11      |

---

### ⚓ Subsetting with Multiple Conditions

You can combine multiple conditions using logical operators:

* `&` (AND)
* `|` (OR)

```python
# Dogs that are both Labradors AND Brown in color
is_lab = dogs["breed"] == "Labrador"
is_brown = dogs["color"] == "Brown"
dogs[is_lab & is_brown]

# Same in one line (remember parentheses!)
dogs[(dogs["breed"] == "Labrador") & (dogs["color"] == "Brown")]
```

📌 Output:

| name  | breed    | color | height\_cm | weight\_kg | date\_of\_birth |
| ----- | -------- | ----- | ---------- | ---------- | --------------- |
| Bella | Labrador | Brown | 56         | 24         | 2013-07-01      |

---

### ✅ Subsetting with `.isin()`

Used when checking if a column’s value is **in a list**.

```python
# Dogs that are either Black or Brown
is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
dogs[is_black_or_brown]
```

📌 Output:

| name    | breed     | color | height\_cm | weight\_kg | date\_of\_birth |
| ------- | --------- | ----- | ---------- | ---------- | --------------- |
| Bella   | Labrador  | Brown | 56         | 24         | 2013-07-01      |
| Charlie | Poodle    | Black | 43         | 24         | 2016-09-16      |
| Lucy    | Chow Chow | Brown | 46         | 24         | 2014-08-25      |
| Max     | Labrador  | Black | 59         | 29         | 2017-01-20      |

---

### 🏁 Summary

| Concept                 | Method/Code Example                            |
| ----------------------- | ---------------------------------------------- |
| Sort DataFrame          | `df.sort_values("col")`                        |
| Subset 1 Column         | `df["col"]`                                    |
| Subset Multiple Columns | `df[["col1", "col2"]]` or `df[cols_to_subset]` |
| Row Condition (numeric) | `df[df["height_cm"] > 50]`                     |
| Row Condition (text)    | `df[df["breed"] == "Labrador"]`                |
| Row Condition (date)    | `df[df["date_of_birth"] < "2015-01-01"]`       |
| Multiple Conditions     | `df[(cond1) & (cond2)]`                        |
| Categorical Filter      | `df[df["color"].isin(["Black", "Brown"])]`     |

🔍 These operations allow you to *quickly extract, clean, and understand data* using pandas.

---



In [None]:
# Sorting rows
# Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

# In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

# Sort on …	Syntax
# one column	df.sort_values("breed")
# multiple columns	df.sort_values(["breed", "weight_kg"])
# By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

# homelessness is available and pandas is loaded as pd.

# Instructions 1/3
# 35 XP
# 1
# Sort homelessness by the number of homeless individuals in the individuals column, from smallest to largest, and save this as homelessness_ind.
# Print the head of the sorted DataFrame.


# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values(['individuals'])

# Print the top few rows
print(homelessness_ind.head())



#                 region         state  individuals  family_members  state_pop
# 50            Mountain       Wyoming        434.0           205.0     577601
# 34  West North Central  North Dakota        467.0            75.0     758080
# 7       South Atlantic      Delaware        708.0           374.0     965479
# 39         New England  Rhode Island        747.0           354.0    1058287
# 45         New England       Vermont        780.0           511.0     624358



# 2. Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values('family_members', ascending= False)

print(homelessness_fam.head())

#                 region          state  individuals  family_members  state_pop
# 32        Mid-Atlantic       New York      39827.0         52070.0   19530351
# 4              Pacific     California     109008.0         20964.0   39461588
# 21         New England  Massachusetts       6811.0         13257.0    6882635
# 9       South Atlantic        Florida      21443.0          9587.0   21244317
# 43  West South Central          Texas      19199.0          6111.0   28628666



# 3. Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])

# Print the top few rows
print(homelessness_reg_fam.head())


#                 region      state  individuals  family_members  state_pop
# 13  East North Central   Illinois       6752.0          3891.0   12723071
# 35  East North Central       Ohio       6929.0          3320.0   11676341
# 22  East North Central   Michigan       5209.0          3142.0    9984072
# 49  East North Central  Wisconsin       2740.0          2167.0    5807406
# 14  East North Central    Indiana       3776.0          1482.0    6695497

In [None]:
# Subsetting columns
# When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use

# df["col_a"]
# To select "col_a" and "col_b" of df, use

# df[["col_a", "col_b"]]
# homelessness is available and pandas is loaded as pd.

# Instructions 1/3
# 35 XP
# 1.Create a Series called individuals that contains only the individuals column of homelessness.

# Select the individuals column
individuals = homelessness['individuals']

print(individuals.head())

# 0      2570.0
# 1      1434.0
# 2      7259.0
# 3      2280.0
# 4    109008.0
# Name: individuals, dtype: float64


# 2.Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]

print(state_fam.head())

#         state  family_members
# 0     Alabama           864.0
# 1      Alaska           582.0
# 2     Arizona          2606.0
# 3    Arkansas           432.0
# 4  California         20964.0

# 3. Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals', 'state']]

print(ind_state.head())

#    individuals       state
# 0       2570.0     Alabama
# 1       1434.0      Alaska
# 2       7259.0     Arizona
# 3       2280.0    Arkansas
# 4     109008.0  California

In [None]:
# Subsetting rows
# A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

# There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

# dogs[dogs["height_cm"] > 60]
# dogs[dogs["color"] == "tan"]
# You can filter for multiple conditions at once by using the "bitwise and" operator, &.

# dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
# homelessness is available and pandas is loaded as pd.


# Instructions 1/3
# 35 XP
# 1. Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# See the result
print(ind_gt_10k)

#                 region       state  individuals  family_members  state_pop
# 4              Pacific  California     109008.0         20964.0   39461588
# 9       South Atlantic     Florida      21443.0          9587.0   21244317
# 32        Mid-Atlantic    New York      39827.0         52070.0   19530351
# 37             Pacific      Oregon      11139.0          3337.0    4181886
# 43  West South Central       Texas      19199.0          6111.0   28628666
# 47             Pacific  Washington      16424.0          5880.0    7523869


# 2. Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.
# Filter for rows where region is Mountain
mountain_reg =  homelessness[homelessness["region"].isin(["Mountain"])]

# See the result
print(mountain_reg)

#    region       state  individuals  family_members  state_pop
#     2   Mountain     Arizona       7259.0          2606.0    7158024
#     5   Mountain    Colorado       7607.0          3250.0    5691287
#     12  Mountain       Idaho       1297.0           715.0    1750536
#     26  Mountain     Montana        983.0           422.0    1060665
#     28  Mountain      Nevada       7058.0           486.0    3027341
#     31  Mountain  New Mexico       1949.0           602.0    2092741
#     44  Mountain        Utah       1904.0           972.0    3153550
#     50  Mountain     Wyoming        434.0           205.0     577601

# 3.Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.

# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]

# See the result
print(fam_lt_1k_pac)

#   region   state  individuals  family_members  state_pop
# 1  Pacific  Alaska       1434.0           582.0     735139


In [None]:
# Subsetting rows by categorical variables
# Subsetting data based on a categorical variable often involves using the or operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

# colors = ["brown", "black", "tan"]
# condition = dogs["color"].isin(colors)
# dogs[condition]
# homelessness is available and pandas is loaded as pd.

# Instructions
# 100 XP
# Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.

# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

# See the result
print(mojave_homelessness)


#           region       state  individuals  family_members  state_pop
#     2   Mountain     Arizona       7259.0          2606.0    7158024
#     4    Pacific  California     109008.0         20964.0   39461588
#     28  Mountain      Nevada       7058.0           486.0    3027341
#     44  Mountain        Utah       1904.0           972.0    3153550



# 🐼 Data Manipulation with Pandas: Adding New Columns

## 📌 Overview
Sometimes, your dataset doesn't have all the information you need. You can **add new columns** to a DataFrame using existing data. This process is also called:
- Mutating a DataFrame
- Transforming a DataFrame
- Feature Engineering

---

## 🧱 Adding a New Column

To create a new column:
```python
dogs["height_m"] = dogs["height_cm"] / 100
print(dogs)
````

This creates a new column `height_m` by converting height from centimeters to meters.

### ✅ Output:

| name    | breed       | color | height\_cm | weight\_kg | date\_of\_birth | height\_m |
| ------- | ----------- | ----- | ---------- | ---------- | --------------- | --------- |
| Bella   | Labrador    | Brown | 56         | 24         | 2013-07-01      | 0.56      |
| Charlie | Poodle      | Black | 43         | 24         | 2016-09-16      | 0.43      |
| Lucy    | Chow Chow   | Brown | 46         | 24         | 2014-08-25      | 0.46      |
| Cooper  | Schnauzer   | Gray  | 49         | 17         | 2011-12-11      | 0.49      |
| Max     | Labrador    | Black | 59         | 29         | 2017-01-20      | 0.59      |
| Stella  | Chihuahua   | Tan   | 18         | 2          | 2015-04-20      | 0.18      |
| Bernie  | St. Bernard | White | 77         | 74         | 2018-02-27      | 0.77      |

---

## 🧮 Doggy Mass Index (BMI)

To calculate BMI (Body Mass Index):

**Formula:**

$$
\text{BMI} = \frac{\text{Weight in kg}}{(\text{Height in m})^2}
$$

```python
dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2
print(dogs.head())
```

### ✅ Output:

| name    | breed     | height\_cm | weight\_kg | height\_m | bmi    |
| ------- | --------- | ---------- | ---------- | --------- | ------ |
| Bella   | Labrador  | 56         | 24         | 0.56      | 76.53  |
| Charlie | Poodle    | 43         | 24         | 0.43      | 129.80 |
| Lucy    | Chow Chow | 46         | 24         | 0.46      | 113.42 |
| Cooper  | Schnauzer | 49         | 17         | 0.49      | 70.80  |
| Max     | Labrador  | 59         | 29         | 0.59      | 83.31  |

---

## 🛠️ Multiple Manipulations

Now let’s combine steps to filter, sort, and subset columns.

### ✅ Step-by-step:

1. **Filter** dogs with BMI less than 100.
2. **Sort** them in descending order of height.
3. **Subset** columns to just `name`, `height_cm`, and `bmi`.

```python
bmi_lt_100 = dogs[dogs["bmi"] < 100]
bmi_lt_100_height  = bmi_lt_100.sort_values("height_cm", ascending=False)
bmi_lt_100_height[["name", "height_cm", "bmi"]]
```

### ✅ Output:

| name   | height\_cm | bmi   |
| ------ | ---------- | ----- |
| Max    | 59         | 83.31 |
| Bella  | 56         | 76.53 |
| Cooper | 49         | 70.80 |
| Stella | 18         | 61.73 |

---

## 🧠 Summary

* You can **mutate** DataFrames by adding new columns using existing ones.
* Combine operations like **subsetting, sorting, and transforming** to extract meaningful insights.
* Use math operations directly on DataFrame columns for **feature engineering** (e.g., converting units, calculating BMI).
* Pandas lets you **chain multiple steps** together for efficient data analysis.

---



In [None]:
# Adding new columns
# You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

# You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

# homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

# homelessness is available and pandas is loaded as pd.

# Instructions
# 100 XP
# Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.
# Add another column to homelessness, named p_homeless, containing the proportion of the total homeless population to the total population in each state state_pop

# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_homeless col as proportion of total homeless population to the state population
homelessness['p_homeless'] =  homelessness['total'] / homelessness['state_pop']

# See the result
print(homelessness.head())


#                     region                 state  individuals  family_members  state_pop     total  p_homeless
#     0   East South Central               Alabama       2570.0           864.0    4887681    3434.0    0.000703
#     1              Pacific                Alaska       1434.0           582.0     735139    2016.0    0.002742
#     2             Mountain               Arizona       7259.0          2606.0    7158024    9865.0    0.001378
#     3   West South Central              Arkansas       2280.0           432.0    3009733    2712.0    0.000901
#     4              Pacific            California     109008.0         20964.0   39461588  129972.0    0.003294
#     5             Mountain              Colorado       7607.0          3250.0    5691287   10857.0    0.001908

In [None]:
# Combo-attack!
# You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

# In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new pandas skills to find out.

# Instructions
# 100 XP
# Add a column to homelessness, indiv_per_10k, containing the number of homeless individuals per ten thousand people in each state, using state_pop for state population.
# Subset rows where indiv_per_10k is higher than 20, assigning to high_homelessness.
# Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.
# Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result. Look at the result.


# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop'] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values(['indiv_per_10k'], ascending=False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state', 'indiv_per_10k']]

# See the result
print(result)


# <script.py> output:
#                        state  indiv_per_10k
#     8   District of Columbia      53.738381
#     11                Hawaii      29.079406
#     4             California      27.623825
#     37                Oregon      26.636307
#     28                Nevada      23.314189
#     47            Washington      21.829195
#     32              New York      20.392363