# Task 1: Load & Inspect

In [1]:
import pandas as pd

In [2]:
# Load your CSV into a DataFrame called 'df'
# Print: shape, column names, data types
# Display first 10 rows

In [3]:
df = pd.read_csv('../../austin-investment-analyzer/data/processed/final_neighborhoods.csv')

In [4]:
print(f'\n' + '='*50)
print('Data Shape: ')
print(df.shape)

print(f'\n' + '='*50)
print("Column Names: ")
print(df.columns)

print(f'\n' + '='*50)
print("Data Types: ")
print(df.dtypes)

print(f'\n' + '='*50)
print("First 10 Rows: ")
print(df.head(10))


Data Shape: 
(56, 337)

Column Names: 
Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
       'State', 'City', 'Metro', 'CountyName', '2000-01-31',
       ...
       'median_monthly_str_income', 'median_bedrooms', 'occupancy_rate',
       'listing_count', 'estimated_ltr_monthly_rent',
       'str_rent_to_price_ratio', 'ltr_rent_to_price_ratio', 'monthly_costs',
       'str_monthly_cash_flow', 'ltr_monthly_cash_flow'],
      dtype='object', length=337)

Data Types: 
RegionID                     int64
SizeRank                     int64
RegionName                  object
RegionType                  object
StateName                   object
                            ...   
str_rent_to_price_ratio    float64
ltr_rent_to_price_ratio    float64
monthly_costs              float64
str_monthly_cash_flow      float64
ltr_monthly_cash_flow      float64
Length: 337, dtype: object

First 10 Rows: 
   RegionID  SizeRank       RegionName    RegionType StateName State    City

# Task 2: Filter Rows - Single Condition 

In [5]:
# Create new DataFrame with rows where [numeric column] > [some value]
# Print how many rows match
# Display first 5 rows of filtered data

In [6]:
condition_df = df[df['str_monthly_cash_flow']> 200]
print(condition_df)

print(f'\n' + '='*50)
print("First 5 Rows of Filtered Data: ")
print(condition_df.head(5))

    RegionID  SizeRank         RegionName    RegionType StateName State  \
4     271353      1435           Downtown  neighborhood        TX    TX   
7     276473      1900       North Burnet  neighborhood        TX    TX   
13    271398      2375     Georgian Acres  neighborhood        TX    TX   
17    271497      3054        North Lamar  neighborhood        TX    TX   
29    271616      4342  Upper Boggy Creek  neighborhood        TX    TX   
44    273354      7263     Coronado Hills  neighborhood        TX    TX   
46    274045      8301              Holly  neighborhood        TX    TX   

      City                             Metro     CountyName     2000-01-31  \
4   Austin  Austin-Round Rock-Georgetown, TX  Travis County            NaN   
7   Austin  Austin-Round Rock-Georgetown, TX  Travis County            NaN   
13  Austin  Austin-Round Rock-Georgetown, TX  Travis County   93127.205275   
17  Austin  Austin-Round Rock-Georgetown, TX  Travis County   88922.739298   
29  Austi

# Task 3: Filter Rows - Multiple Conditions

In [7]:
# Create new DataFrame with rows where:
#   - [numeric column 1] > [value] AND
#   - [numeric column 2] < [value]
# Print shape of result

In [8]:
task3 = df[(df['str_monthly_cash_flow'] > 500) & (df['monthly_costs'] < 2000)]
print(task3)

print(f'\n' + '='*50)
print("Task 3 Shape: ")
print(task3.shape)

Empty DataFrame
Columns: [RegionID, SizeRank, RegionName, RegionType, StateName, State, City, Metro, CountyName, 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-30, 2000-05-31, 2000-06-30, 2000-07-31, 2000-08-31, 2000-09-30, 2000-10-31, 2000-11-30, 2000-12-31, 2001-01-31, 2001-02-28, 2001-03-31, 2001-04-30, 2001-05-31, 2001-06-30, 2001-07-31, 2001-08-31, 2001-09-30, 2001-10-31, 2001-11-30, 2001-12-31, 2002-01-31, 2002-02-28, 2002-03-31, 2002-04-30, 2002-05-31, 2002-06-30, 2002-07-31, 2002-08-31, 2002-09-30, 2002-10-31, 2002-11-30, 2002-12-31, 2003-01-31, 2003-02-28, 2003-03-31, 2003-04-30, 2003-05-31, 2003-06-30, 2003-07-31, 2003-08-31, 2003-09-30, 2003-10-31, 2003-11-30, 2003-12-31, 2004-01-31, 2004-02-29, 2004-03-31, 2004-04-30, 2004-05-31, 2004-06-30, 2004-07-31, 2004-08-31, 2004-09-30, 2004-10-31, 2004-11-30, 2004-12-31, 2005-01-31, 2005-02-28, 2005-03-31, 2005-04-30, 2005-05-31, 2005-06-30, 2005-07-31, 2005-08-31, 2005-09-30, 2005-10-31, 2005-11-30, 2005-12-31, 2006-01-31, 2006-02-28,

# Task 4: Select Columns

In [9]:
# Create new DataFrame with only 3 columns of your choice
# Display first 5 rows

In [10]:
task4 = df[['monthly_costs', 'str_monthly_cash_flow', 'ltr_monthly_cash_flow']]
print(task4.head(5))

   monthly_costs  str_monthly_cash_flow  ltr_monthly_cash_flow
0    4480.002356           -1780.002356           -3400.002356
1    3412.506605           -1392.906605           -2604.666605
2    2016.469971           -1771.669971           -1918.549971
3    2031.072669              -0.672669           -1218.912669
4    4946.407613             971.992387           -2579.047613


# Task 5: Create New Column 

In [11]:
# Create a new column that is calculated from existing columns
# Example: 'price_per_bedroom' = price / bedrooms
# Or: 'total_rooms' = bedrooms + bathrooms
# Display first 5 rows showing new column

In [12]:
df['str_vs_ltr_cashflow_delta'] = df['str_monthly_cash_flow'] - df['ltr_monthly_cash_flow']
df['str_vs_ltr_cashflow_delta'].head(5)

0    1620.00
1    1211.76
2     146.88
3    1218.24
4    3551.04
Name: str_vs_ltr_cashflow_delta, dtype: float64

# Task 6: Basic Aggregations

In [13]:
# Calculate and print:
# - Mean of a numeric column
# - Median of a numeric column
# - Standard deviation of a numeric column
# - Min and max values

In [14]:
print("New Column Mean: ")
print(df['str_vs_ltr_cashflow_delta'].mean())

print(f'\n' + '='*50)
print("New Column Median: ")
print(df['str_vs_ltr_cashflow_delta'].median())

print(f'\n' + '='*50)
print("New Column Standard Deviation: ")
print(df['str_vs_ltr_cashflow_delta'].std())

print(f'\n' + '='*50)
print("New Column Min: ")
print(df['str_vs_ltr_cashflow_delta'].min())

print(f'\n' + '='*50)
print("New Column Max: ")
print(df['str_vs_ltr_cashflow_delta'].max())

New Column Mean: 
1386.9883928571428

New Column Median: 
1290.6

New Column Standard Deviation: 
854.6805889002187

New Column Min: 
86.94000000000005

New Column Max: 
4443.12


# Task 7: Group By and Aggregate

In [15]:
# Group by a categorical column (neighborhood, room_type, etc.)
# Calculate mean of a numeric column for each group
# Display results

In [16]:
print(df.groupby('RegionName')['str_monthly_cash_flow'].mean())

RegionName
Allandale              -2637.204690
Barton Creek          -15841.967681
Barton Hills           -4502.526495
Bouldin Creek          -2746.557413
Brentwood              -2737.720143
Brushy Bend Park       -5684.270203
Bryker Woods           -8546.335465
Central East Austin     -253.399564
Chestnut               -1488.394228
Coronado Hills          5070.504484
Crestview              -2458.084771
Dawson                 -1257.599484
Downtown                 971.992387
East Cesar Chavez       -852.617060
East Congress          -2706.786472
Fern Bluff             -2182.624664
Franklin Park             -0.672669
Galindo                -3053.969670
Georgian Acres           712.700414
Govalle                 -757.535082
Hancock                -2138.110931
Highland                -926.169760
Holly                    442.010824
Hyde Park               -920.901285
Johnston Terrace        -479.356840
Kensington Place       -1180.449622
MLK                     -660.464401
Mckinney         

# Task 8: Sort and Get Top N

In [17]:
# Sort DataFrame by a column (descending)
# Get top 10 rows
# Display them

In [18]:
print(df.sort_values('str_monthly_cash_flow', ascending=False).head(10))

    RegionID  SizeRank         RegionName    RegionType StateName State  \
44    273354      7263     Coronado Hills  neighborhood        TX    TX   
4     271353      1435           Downtown  neighborhood        TX    TX   
17    271497      3054        North Lamar  neighborhood        TX    TX   
13    271398      2375     Georgian Acres  neighborhood        TX    TX   
29    271616      4342  Upper Boggy Creek  neighborhood        TX    TX   
7     276473      1900       North Burnet  neighborhood        TX    TX   
46    274045      8301              Holly  neighborhood        TX    TX   
12    271529      2318        Parker Lane  neighborhood        TX    TX   
3     271391      1205      Franklin Park  neighborhood        TX    TX   
16    271665      3001             Wooten  neighborhood        TX    TX   

      City                             Metro     CountyName     2000-01-31  \
44  Austin  Austin-Round Rock-Georgetown, TX  Travis County  136996.284701   
4   Austin  Austin

# Task 9: Handle Missing Values

In [19]:
# Check for missing values in your DataFrame
# Print count of missing values per column
# Drop rows with any missing values OR fill them (your choice)
# Print new shape

In [34]:
print(f'\n' + '='*50)
print("Count of Missing Values Per Column: ")
print(df.isna().sum())

df.fillna(0, inplace=True)

print(f'\n' + '='*50)
print("New Shape: ")
print(df.shape)


Count of Missing Values Per Column: 
RegionID                     0
SizeRank                     0
RegionName                   0
RegionType                   0
StateName                    0
                            ..
ltr_rent_to_price_ratio      0
monthly_costs                0
str_monthly_cash_flow        0
ltr_monthly_cash_flow        0
str_vs_ltr_cashflow_delta    0
Length: 338, dtype: int64

New Shape: 
(56, 338)


# Task 10: Save Processed Data

In [35]:
# Save your filtered/processed DataFrame to a new CSV
# Filename: 'processed_data.csv'

In [36]:
df.to_csv('processed_data.csv', index=False)

# Part 2: Conceptual Check

**Question 1: Data Structures**
Explain the difference between a Python list and a pandas DataFrame. When would you use each?

*A python list is a single axis/row store of data, whereas pandas DataFrames allow x,y axis for rows and columns. You would use a list for simple record keeping. You would use a DataFrame for complex data with multiple pieces of interacting information.*

**Question 2: Filtering Logic**
What does this code do: `df[(df['price'] > 100) & (df['bedrooms'] >= 2)]`
Explain the `&` symbol and why conditions need parentheses.

*It creates a new dataframe that shows only the rows that meet both column conditions, of price greater than 100 as well as bedrooms greater than or equal to 2. The and symbol means the new dataframe must meet both conditions. The parenthesis indicate it is 2 conditions rather than a column within a column.*

**Question 3: Groupby Operation**
Explain to a non-technical person what `df.groupby('neighborhood').mean()` does.
Use an analogy (like organizing books by genre, calculating average price per store, etc.)

*This method allows you to take a column of interest and apply one function to every row in the column, all at once. It's like hitting reply all to an email instead of typing every individual email address.*

**Question 4: Statistics Basics**
Define in plain English:

- Mean
- Median
- Standard deviation

When would median be more useful than mean?

*Mean is the average of all numbers combines. Median is exactly in the middle between highest and lowest value. Standard Deviation is (I'm not super clear on this). Median would be more useful than mean in instances with extreme outliers that don't give an accurate representation of the bulk of the data yet they significantly impact the mean. This happens a lot in housing data where a few expensive homes skew the insights regarding the majority of the homes.*

**Question 5: Missing Data**
You have a dataset with missing values. Describe two strategies for handling them and when you'd use each.

*For missing data, you could leave empty as NaN, or you could forward/back fill. Depending on the nature of the project and data, sometimes empty values need to be preserved for accuracy at later steps in the project. But for most accounting and financial analysis, you may prefer to fill those missing values.*

## Part 3: Gap Analysis (15-30 min)

**Review your results honestly. Answer these questions:**

### What's Automatic Now?

**List Tier 1 skills that felt easy (no lookup needed):**

- Example: "Loading CSV, displaying head, getting shape"
- Example: "Filtering with single condition"
- Everything felt natural with the initial setup, from pd.read_csv(pathway) through all of the quick commands to give me a clear mental picture of the data's size, shape, columns names as well as dtypes for each column.

---

### What Still Requires Lookup?

**List skills where you needed docs or got stuck:**

- Example: "Groupby syntax - kept forgetting parentheses"
- Example: "Multiple condition filtering - & vs 'and'"
- I needed to use panda docs for finding missing values, as well as an option for dealing with those missing values. I chose .ffill(). I also needed confirmation that 'to_csv()' was the correct way to save the new data frame. But that is only because I haven't manually done those before, they will click pretty fast as it becomes more routine.

---

### What Conceptual Gaps Remain?

**Topics that feel fuzzy or incomplete:**

- Example: "Don't fully understand when to use groupby vs filter"
- Example: "Missing data strategies - not sure when to drop vs fill"
- I felt fuzzy on both of those examples. But it might be because I'm following the instructions of what to find and filter and sort by. However as the tasks progressed, I began to feel a greater sense of agency for the data. Which makes it easier for me to visualize the data and sort of picture what I want to get from it, which then makes it easier to connect those mental images of how I want the rows and columns to move to how those movements pair up with the pandas commands that create the changes.