# HW3B - Pandas Fundamentals

See Canvas for details on how to complete and submit this assignment.

## Introduction

This assignment transitions you from NumPy's numerical array operations to Pandas' powerful tabular data manipulation. While NumPy excels at homogeneous numerical arrays, Pandas is designed for the heterogeneous, labeled data that characterizes most real-world datasets—mixing dates, categories, numbers, and text within the same table.

You'll work with real bike share data from Chicago's Divvy system to answer questions about urban transportation patterns. Through three progressively complex problems—exploring usage patterns, analyzing rider behavior, and conducting temporal analysis—you'll discover why Pandas has become the standard tool for data analysis in Python.

The assignment emphasizes Pandas' design philosophy: named column access, explicit indexing methods (loc/iloc), handling missing data, and method chaining for readable data pipelines. You'll also see how Pandas builds on NumPy while adding the structure and convenience needed for practical data science work.

This assignment should take 3-5 hours to complete.

Before submitting, ensure your notebook:

- Runs completely with "Kernel → Restart & Run All"
- Includes thoughtful responses to all interpretation questions
- Uses clear variable names and follows good coding practices
- Shows your work (don't just print final answers)

### Learning Objectives

By completing this assignment, you will be able to:

1. **Construct and manipulate Pandas data structures**
   - Create DataFrames from dictionaries and CSV files
   - Distinguish between Series and DataFrame objects
   - Set and reset index structures appropriately
   - Understand when operations return views vs copies
2. **Apply explicit indexing paradigms**
   - Use `loc[]` for label-based data access
   - Use `iloc[]` for position-based data access
   - Access columns using bracket notation
   - Explain when each indexing method is appropriate
3. **Diagnose and explore datasets systematically**
   - Use `info()`, `describe()`, `head()`, and `dtypes` to understand data structure
   - Identify missing values with `isna()` and `notna()`
   - Calculate summary statistics across different axes
   - Interpret value distributions with `value_counts()`
4. **Filter data with boolean indexing and queries**
   - Combine multiple conditions with `&`, `|`, and `~` operators
   - Use `isin()` for membership testing
   - Apply `query()` for readable complex filters
   - Understand how index alignment affects operations
5. **Work with datetime data**
   - Parse dates during CSV loading
   - Extract temporal components with the `.dt` accessor
   - Filter data by date ranges
   - Create time-based derived features
6. **Connect Pandas patterns to data analysis workflows**
   - Formulate questions that data can answer
   - Choose appropriate methods for different analysis tasks
   - Interpret results in domain context
   - Recognize when vectorized operations outperform apply()

### Generative AI Allowance

You may use GenAI tools for brainstorming, explanations, and code sketches if you disclose it, understand it, and validate it. Your submission must represent your own work and you are solely responsible for its correctness.

### Scoring

Total of 90 points available, will be graded out of 80. Scores of >100% are allowed.

Distribution:

- Tasks: 48 pts
- Interpretation: 32 pts
- Reflection: 10 pts

Points by Problem:

- Problem 1: 3 tasks, 10 pts
- Problem 2: 4 tasks, 14 pts
- Problem 3: 4 tasks, 14 pts
- Problem 4: 3 tasks, 10 pts

Interpretation Questions:

- Problem 1: 3 questions, 8 pts
- Problem 2: 4 questions, 8 pts
- Problem 3: 3 questions, 8 pts
- Problem 4: 3 questions, 8 pts

Graduate differentiation: poor follow-up responses will result in up to a 5pt deduction for that problem.

## Dataset: Chicago Divvy Bike Share

The dataset you will analyze is based on real trip information from Divvy, Chicago's bike share system. It contains individual trips with start/end times, station information, and rider type.

Dataset homepage: https://divvybikes.com/system-data

Each trip includes:

- Trip start and end times (datetime)
- Start and end station names and IDs
- Rider type (member vs casual)
- Bike type (classic, electric, or docked)

Chicago's Department of Transportation uses this data to optimize station placement, understand usage patterns, and improve service. You'll explore similar questions that real transportation analysts investigate.

## Problems

### Problem 1: Creating DataFrames from Scratch

Before loading data from files, you need to understand how Pandas structures are built. In this problem, you'll create Series and DataFrames manually using Python's built-in data structures. This is a quick warmup to establish the fundamentals.

#### Task 1a: Create a Series

Create a Series called `temperatures` representing daily high temperatures for a week:

- Monday: 72°F
- Tuesday: 75°F  
- Wednesday: 68°F
- Thursday: 71°F
- Friday: 73°F

Use the day names as the index. Print the Series and its data type.

##### Your Code

In [4]:
import pandas as pd

# Task 1a code here...
# daily_hight_temps = {"Monday": 72, "Tuesday": 75, "Wednesday": 68, "Thursday": 71, "Friday": 73}
temperatures = pd.Series([72, 75, 68, 71, 73], 
                             index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"])

#temperatures = pd.Series(daily_hight_temps) 
#print(temperatures)
temperatures

Monday       72
Tuesday      75
Wednesday    68
Thursday     71
Friday       73
dtype: int64

#### Task 1b: Create a DataFrame from a Dictionary

Create a DataFrame called `products` with the following data:

| product | price | quantity |
|---------|-------|----------|
| Widget  | 19.99 | 100 |
| Gadget  | 24.99 | 75 |
| Doohickey | 12.49 | 150 |

Use a dictionary where keys are column names and values are lists. Print the DataFrame and report its shape.

##### Your Code

In [5]:
# Task 1b code here...
products = pd.DataFrame({
         'product': ['Widget', 'Gadget', 'Doohickey'],
         'price': [19.99, 24.99, 12.49],
         'quantity': [100, 75, 150]
})
print("Products: ")
print(products)
products.shape

Products: 
     product  price  quantity
0     Widget  19.99       100
1     Gadget  24.99        75
2  Doohickey  12.49       150


(3, 3)

#### Task 1c: Access DataFrame Elements

Using the `products` DataFrame from Task 1b, extract and print:

1. The `price` column as a Series
2. The `product` and `quantity` columns as a DataFrame (using a list of column names)

##### Your Code

In [53]:
# Task 1c code here...
products['price']  
products[['product', 'quantity']]

Unnamed: 0,product,quantity
0,Widget,100
1,Gadget,75
2,Doohickey,150


#### Interpretation

Answer the following questions (briefly / concisely) in the markdown cell below:

1. Data structure mapping: When you create a DataFrame from a dictionary (like in Task 1b), what do the dictionary keys become? What do the values become?
2. Bracket notation: Why does `df['price']` return a Series, but `df[['price']]` return a DataFrame? What's the difference in what you're asking for?
3. Index purpose: In Task 1a, you used day names as the index instead of default numbers (0, 1, 2...). When would a custom index like this be more useful than the default numeric index?

##### Your Answers

#### Problem 1 Interpretation 

1. The dictionary _keys_ become columns and _values_ become rows.
2. `df['price']` returns a Series and `df[['price']]` returns a DataFrame due to the different syntax used, which is *single* versus *double* brackets:
    - When using *single* brackets in `df['price']`, the code is telling pandas to return the values of the columns with no **heading** for each column, but only integer values for labeling each row.
    - When using *double* brackets in `df[['price']]`, the code is telling pandas to return the values of the columns with a **heading** for each column in addition to integer values for labeling each row.
3. A custom index would be more useful possibly when a data analyst wants to index/access a row by a *name* instead of an *integer* 



### Problem 2: Loading and Initial Exploration

Before starting this problem, make sure you are working in a copy of this file in the `my_repo` folder you created in HW2a. You must also have a copy of the file `202410-divvy-tripdata-100k.csv` in a subdirectory called `data`. That file structure is illustrated below.

```text
~/insy6500/my_repo
└── homework
    ├── data
    │   └── 202410-divvy-tripdata-100k.csv
    └── hw3b.ipynb
```

#### Task 2a: Load and Understand Raw Data

Start by loading the data "as-is" to get a general understanding of the overall structure and how Pandas interprets it by default.

Note on file paths: The provided code uses `Path` from Python's `pathlib` module to handle file paths. Path objects work consistently across operating systems (Windows uses backslashes `\`, Mac/Linux use forward slashes `/`), automatically using the correct separator for your system. The provided code defines `csv_path` which should be used as the filename in your `pd.read_csv` to load the data file.

1. Use `pd.read_csv` to load `csv_path` (provided below) without specifying any other arguments. Assign it to the variable `df_raw`.
2. Use the methods we described in class to explore the shape, structure, types, etc. of the data. In particular, consider which columns represent dates or categories.
3. Note the amount of memory used by the dataset. See the section on memory diagnostics in notebook 07a for appropriate code snippets using `memory_usage`.

##### Your Code

In [6]:
#### import pandas as pd
import numpy as np
from pathlib import Path

# create a OS-independent pointer to the csv file created by Setup
csv_path = Path('./data/202410-divvy-tripdata-100k.csv')

# load and explore the data below (create additional code / markdown cells as necessary)
df_raw = pd.read_csv(csv_path)
df_raw.info # Displays the information/data of all the columns
#df_raw.dtypes
#df_raw.describe()


<bound method DataFrame.info of                 ride_id  rideable_type               started_at  \
0      67BB74BD7667BAB7  electric_bike  2024-09-30 23:12:01.622   
1      5AF1AC3BA86ED58C  electric_bike  2024-09-30 23:19:25.409   
2      7961DD2FC1280CDC   classic_bike  2024-09-30 23:32:24.672   
3      2E16892DEEF4CC19   classic_bike  2024-09-30 23:42:11.207   
4      AAF0220F819BEE01  electric_bike  2024-09-30 23:49:25.380   
...                 ...            ...                      ...   
99995  6D5AFF497514A788   classic_bike  2024-10-31 23:44:23.211   
99996  527E9D2BDCAFFEC4   classic_bike  2024-10-31 23:44:45.948   
99997  33A63439F82E7542   classic_bike  2024-10-31 23:50:31.160   
99998  2BE6AF69988C197F   classic_bike  2024-10-31 23:53:02.355   
99999  A925983EBD0E911E   classic_bike  2024-10-31 23:54:02.851   

                      ended_at         start_station_name start_station_id  \
0      2024-10-01 00:20:00.674     Oakley Ave & Touhy Ave           bdd4c3   
1      

In [168]:
#### import pandas as pd
import numpy as np
from pathlib import Path

# create a OS-independent pointer to the csv file created by Setup
csv_path = Path('./data/202410-divvy-tripdata-100k.csv')

# load and explore the data below (create additional code / markdown cells as necessary)
df_raw = pd.read_csv(csv_path)
#df_raw.info() # Displays columns in row format and specifically shows only two characteristics of the columns 
#df_raw.describe()

# Check memory usage by column
print("Memory usage by column:")
print(df_raw.memory_usage(deep=True))

print(f"\nData Types: \n{df_raw.dtypes}")

Memory usage by column:
Index                     132
ride_id               7300000
rideable_type         6950493
started_at            8000000
ended_at              8000000
start_station_name    7549653
start_station_id      5978313
end_station_name      7546619
end_station_id        5974035
start_lat              800000
start_lng              800000
end_lat                800000
end_lng                800000
member_casual         6300000
dtype: int64

Data Types: 
ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object


#### Task 2b: Reload with Proper Data Types

1. Repeat step 2a.1 to reload the data. Use the `dtype` and `parse_dates` arguments to properly assign categorical and date types. Assign the result to the variable name `rides`.
2. After loading, use `rides.info()` to confirm the type changes.
3. Use `memory_usage` to compare the resulting size with that from step 2a.3.

##### Your Code

In [9]:
# task 2b code here...
df_raw = pd.read_csv(csv_path)    

# dtypes_data = df_raw.dtypes
# df_raw.dtypes
#df_raw.info()

dtype_dict={ 'ride_id': 'object',
             'rideable_type': 'category',
             'start_station_name': 'category',
             'start_station_id': 'category',
             'end_station_name': 'category',
             'end_station_id': 'category',
             'member_casual': 'category'                
           }  
                                
rides = pd.read_csv(csv_path, 
                    dtype=dtype_dict,
                    parse_dates=['started_at', 'ended_at']) 

# print("Rides: ")
# print(rides) 
print("arrangement of data with proper data types: \n")
rides.info()  

# Check memory usage by column
print("\n Memory usage by column with raw data: ")
print(df_raw.memory_usage(deep=True))

# Check memory usage by column
print("\nMemory usage by column with proper data types:")
print(rides.memory_usage(deep=True))

print("\nTotal memory usage with raw data:")
print(f"{df_raw.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\nTotal memory usage with proper data types:")
print(f"{rides.memory_usage(deep=True).sum() / 1024**2:.2f} MB")



arrangement of data with proper data types: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   ride_id             100000 non-null  object        
 1   rideable_type       100000 non-null  category      
 2   started_at          100000 non-null  datetime64[ns]
 3   ended_at            100000 non-null  datetime64[ns]
 4   start_station_name  89623 non-null   category      
 5   start_station_id    89623 non-null   category      
 6   end_station_name    89485 non-null   category      
 7   end_station_id      89485 non-null   category      
 8   start_lat           100000 non-null  float64       
 9   start_lng           100000 non-null  float64       
 10  end_lat             99913 non-null   float64       
 11  end_lng             99913 non-null   float64       
 12  member_casual       100000 non-null  cate

#### Task 2c: Explore Structure and Missing Data

Using the `rides` DataFrame from Task 2b:

1. Determine the range of starting dates in the dataframe using the `min` and `max` methods.
2. Count the number of missing values in each column. See the section of the same name in lecture 06b.
3. Convert the Series from step 2 to a DataFrame using `.to_frame(name='count')`, then add a column called 'percentage' that calculates the percentage of missing values for each column.

##### Your Code

In [10]:
# task 2c code here...
range_startDates = rides['started_at']

#print(range_startDates) 
print(f"Start dates: \n{range_startDates}\n")
#print(range_startDates.min) 

Minimum_date = range_startDates.min()
Maximum_date = range_startDates.max()
print(f"Minumum date: \n{Minimum_date}\n")
print(f"Maximum date: \n{Maximum_date}\n")

#print(rides.isna())
Num_NaN = rides.isna().sum()
print(f"Number of missing values: \n{Num_NaN}")

Num_NaN_DateFrame = Num_NaN.to_frame(name='count')
Num_NaN_DateFrame['percentage'] = ((Num_NaN_DateFrame['count'] / len(rides)) * 100)
# Num_NaN_DateFrame['new_column'] -> standard way to add a new column in pandas 

print(Num_NaN_DateFrame)


Start dates: 
0       2024-09-30 23:12:01.622
1       2024-09-30 23:19:25.409
2       2024-09-30 23:32:24.672
3       2024-09-30 23:42:11.207
4       2024-09-30 23:49:25.380
                  ...          
99995   2024-10-31 23:44:23.211
99996   2024-10-31 23:44:45.948
99997   2024-10-31 23:50:31.160
99998   2024-10-31 23:53:02.355
99999   2024-10-31 23:54:02.851
Name: started_at, Length: 100000, dtype: datetime64[ns]

Minumum date: 
2024-09-30 23:12:01.622000

Maximum date: 
2024-10-31 23:54:02.851000

Number of missing values: 
ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    10377
start_station_id      10377
end_station_name      10515
end_station_id        10515
start_lat                 0
start_lng                 0
end_lat                  87
end_lng                  87
member_casual             0
dtype: int64
                    count  percentage
ride_id                 0       0.000
rideable_ty

#### Task 2d: Create Trip Duration Column and Set Index

Before setting the index, create a derived column for trip duration:

1. Calculate trip_duration_min by subtracting `started_at` from `ended_at`, then converting to minutes using `.dt.total_seconds() / 60`
3. Display basic statistics (min, max, mean) for the new column using `.describe()`
4. Show the first few rows with `started_at`, `ended_at`, and `trip_duration_min` to verify the calculation
5. Set `started_at` as the DataFrame's index. Verify the change by printing the index and displaying the first few rows.

##### Your Code

In [162]:
# task 2d code here...
rides_DataFrame = pd.DataFrame(rides, columns=['ride_id',
                                               'rideable_type',
                                               'started_at',
                                               'ended_at',
                                               'start_station_name',
                                               'start_station_id',
                                               'end_station_name',
                                               'end_station_id',
                                               'start_lat',
                                               'start_lng',
                                               'end_lat',
                                               'end_lng',
                                               'member_casual'])

trip_duration_sec = rides_DataFrame['ended_at'] - rides_DataFrame['started_at']
trip_duration_min = trip_duration_sec.dt.total_seconds() / 60
rides_DataFrame['trip_duration'] = trip_duration_min 

print(f"Trip duration in minutes: \n{trip_duration_min}\n") 
#print(rides) 

rides_statistics = rides_DataFrame['trip_duration'].describe()
print(f"Basic Statistics: \n{rides_statistics}\n")

print(rides_DataFrame)
#print(f"\nData Types: \n{df_raw.dtypes}")

#df_raw.info
rides_DataFrame_date = rides_DataFrame.set_index('started_at')
print(rides_DataFrame_date) 

Trip duration in minutes: 
0        67.984200
1        82.742067
2        50.899583
3        28.093733
4        17.034933
           ...    
99995     5.130183
99996     6.029167
99997     5.222133
99998     5.422000
99999     3.690467
Length: 100000, dtype: float64

Basic Statistics: 
count    100000.000000
mean         16.144576
std          52.922539
min           0.006533
25%           5.489271
50%           9.423592
75%          16.407171
max        1499.949717
Name: trip_duration, dtype: float64

                ride_id  rideable_type              started_at  \
0      67BB74BD7667BAB7  electric_bike 2024-09-30 23:12:01.622   
1      5AF1AC3BA86ED58C  electric_bike 2024-09-30 23:19:25.409   
2      7961DD2FC1280CDC   classic_bike 2024-09-30 23:32:24.672   
3      2E16892DEEF4CC19   classic_bike 2024-09-30 23:42:11.207   
4      AAF0220F819BEE01  electric_bike 2024-09-30 23:49:25.380   
...                 ...            ...                     ...   
99995  6D5AFF497514A788   clas

#### Interpretation

Reflect on problem 2 and answer (briefly / concisely) the following questions:

1. What types did Pandas assign to `started_at` and `member_casual` in Task 2a? Why might these defaults be problematic?
2. Look at the values in the station ID fields. Based on what you learned about git commit IDs in HW3a, how do you think the station IDs were derived?
3. Explain in your own words what method chaining is, what `df.isna().sum()` does and how it works.
4. Assume you found ~10% missing values in station columns but ~0% in coordinates. What might explain this? How might you handle the affected rows?

##### Your Answers

#### Problem 2 Interpretation 
1. Pandas assigned the *data type*, `object` to both `started_at` and `member_casual`.
    - These default *data types* could be problematic because the two items are actually different types of information/data.   
        - `started_at` actually includes dates and times, versus `member_casua` includes two options, which describe/label the type of rider as *member* or *casual*  
1. The directions in HW3a explained that commits are given a unique ID; therefore, I would think that the station ID fields are randomly generated. HW3a further describes them as a fingerprint.  
1. Method chaining is using `df.isna().sum()` where:  
   - The `isna` method finds all the non-number, NaN, values and labels them as `True` or `False`. Since Python uses True and False as labels for the binary values `1` and `0`, the `sum` function can be used to count the number of `True` values, which are the NaN values.
1. Depending on how the name of the location was found, there might not have been a specific location name to correlate with the coordinates. 
    - Another possible explanation is that there was a technical error in the software, in which pandas skipped over the data when reading the CSV file. 

#### Follow-Up (Graduate Students Only)

Compare memory usage results in 2a.3 and 2b.3. What caused the change? Why are these numbers different from what is reported at the bottom of `df.info()`? Which should you use if data size is a concern?

Working with DataFrames typically requires 5-10x the dataset size in available RAM. On a system with 16GB, assuming about 30% overhead from the operating system and other programs, what range of dataset sizes would be safely manageable? Calculate using both 5x (optimistic) and 10x (conservative) multipliers, then explain which you'd recommend for reliable work.

##### Your Answers

Total memory usage with raw data: 63.70 MB  
Total memory usage with proper data types: 12.85 MB
- The format of the raw data occupied much more memory than a format with the proper data types.
    - I believe this could be due to the raw data being displayed in a format where column items were paired with appropriate/correct data types according to their type of value, which in this case is a float, category, object, and datetime.
    - If data size is a concern, the format for proper data types should be used
- 


### Problem 3: Filtering and Transformation

#### Task 3a: Boolean Indexing and Membership Testing

Use boolean indexing and the `isin()` method to answer these questions:

1. How many trips were taken by *members* using *electric bikes*? Use `&` to combine conditions.
2. What percentage of all trips does this represent?
3. How many trips started at any of these three stations: "Streeter Dr & Grand Ave", "DuSable Lake Shore Dr & Monroe St", or "Kingsbury St & Kinzie St"? Use `isin()`.

Note: Remember to use parentheses around each condition when combining with `&`.

##### Your Code

In [187]:
# Task 3a code here...
e_bikes = rides[(rides['member_casual'] == 'member') & (rides['rideable_type'] == 'electric_bike')]
num_e_bikes = len(e_bikes) 

bikes = rides['rideable_type']
# Could technically use 'bikes' but using rides also returns the amount of rows in each column 
num_bikes = len(rides) 

num_trips_values = rides[rides['start_station_name'].isin(['DuSable Lake Shore Dr & Monroe St'])] 
num_trips = len(num_trips_values)

#print(e_bikes)
print(f"\nTrips taken by members using electric bikes: {num_e_bikes}\n")
print(bikes)
print(f"\nTrips taken by members using bikes: {num_bikes}\n")

print(f"Percent of trips used with electric bikes: {(num_e_bikes/num_bikes)*100}%\n")

print(num_trips_values)
print(f"\nNumber of trips started at 'DuSable Lake Shore Dr & Monroe St': {num_trips}\n")  



Trips taken by members using electric bikes: 33121

started_at
2024-09-30 23:12:01.622    electric_bike
2024-09-30 23:19:25.409    electric_bike
2024-09-30 23:32:24.672     classic_bike
2024-09-30 23:42:11.207     classic_bike
2024-09-30 23:49:25.380    electric_bike
                               ...      
2024-10-31 23:44:23.211     classic_bike
2024-10-31 23:44:45.948     classic_bike
2024-10-31 23:50:31.160     classic_bike
2024-10-31 23:53:02.355     classic_bike
2024-10-31 23:54:02.851     classic_bike
Name: rideable_type, Length: 100000, dtype: category
Categories (2, object): ['classic_bike', 'electric_bike']

Trips taken by members using bikes: 100000

Percent of trips used with electric bikes: 33.121%

                                  ride_id  rideable_type  \
started_at                                                 
2024-10-01 11:00:12.236  ADC5550BDA0B810A   classic_bike   
2024-10-01 11:41:12.895  1DCB2A400A2A4506   classic_bike   
2024-10-01 11:46:13.633  DAC4064243C3

#### Task 3b: Create Derived Columns from Datetime+

Add two categorical columns to the rides DataFrame based on trip start time:

1. `is_weekend`: Boolean column that is True for Saturday/Sunday trips. Use .dt.dayofweek on the index (Monday=0, Sunday=6).
2. `time_of_day`: String categories based on start hour:
   - "Morning Rush" if hour is 7, 8, or 9
   - "Evening Rush" if hour is 16, 17, or 18
   - "Midday" for all other hours

For step 2, initialize the column to "Midday", then use .loc[mask, 'time_of_day'] with boolean masks to assign rush hour categories. Extract hour using .dt.hour on the index.

After creating both columns, use value_counts() on time_of_day to show the distribution.


#### Brainstorming 
- Creating *categories* in the column `time_of_day` and assigning them the appropiate start hours
    - Use `.loc[mask, 'time_of_day']` **but first**:
      - Need to use `.dt.hour` to extract the hour values from the column `'started_at'` 

##### Your Code

In [175]:
# Task 3b code here...

#rides['day_of_week'] = rides['started_at'].dt.dayofweek
day_of_week = rides_DataFrame['started_at'].dt.dayofweek
print(day_of_week) 

rides_DataFrame['is_weekend'] = day_of_week.isin([5,6]) # the numbers for days are being listed under the 'is_weekend' column  

rides_DataFrame['time_of_day'] = "Midday" # Initializing each value in this column to "Midday"
hour_values = rides_DataFrame['started_at'].dt.hour # Extracting hour values from the 'started_at' column 
#print(hour_values) 

specific_hours_MR = hour_values.isin([7,8,9]) # Finding hour values: [7,8,9]
rides_DataFrame.loc[specific_hours_MR, 'time_of_day'] = "Morning Rush" 

specific_hours_ER = hour_values.isin([16,17,18]) # Finding hour values: [16,17,18]
rides_DataFrame.loc[specific_hours_ER, 'time_of_day'] = "Evening Rush" 

print(rides_DataFrame['time_of_day'].value_counts())

#del rides['is_weekend']
#del rides['day_of_week']

print(rides_DataFrame)

0        0
1        0
2        0
3        0
4        0
        ..
99995    3
99996    3
99997    3
99998    3
99999    3
Name: started_at, Length: 100000, dtype: int32
time_of_day
Midday          55912
Evening Rush    28218
Morning Rush    15870
Name: count, dtype: int64
                ride_id  rideable_type              started_at  \
0      67BB74BD7667BAB7  electric_bike 2024-09-30 23:12:01.622   
1      5AF1AC3BA86ED58C  electric_bike 2024-09-30 23:19:25.409   
2      7961DD2FC1280CDC   classic_bike 2024-09-30 23:32:24.672   
3      2E16892DEEF4CC19   classic_bike 2024-09-30 23:42:11.207   
4      AAF0220F819BEE01  electric_bike 2024-09-30 23:49:25.380   
...                 ...            ...                     ...   
99995  6D5AFF497514A788   classic_bike 2024-10-31 23:44:23.211   
99996  527E9D2BDCAFFEC4   classic_bike 2024-10-31 23:44:45.948   
99997  33A63439F82E7542   classic_bike 2024-10-31 23:50:31.160   
99998  2BE6AF69988C197F   classic_bike 2024-10-31 23:53:02.355   
99

#### Task 3c: Complex Filtering with query()

Use the `query()` method to find trips that meet **all** of these criteria:
- Casual riders (not members)
- Weekend trips  
- Duration greater than 20 minutes
- Electric bikes

Report:
1. How many trips match these criteria?
2. What percentage of all trips do they represent?
3. What is the average duration of these trips?

Hint: Column names work directly in `query()` strings. Combine conditions with `and`.

##### Your Code

In [176]:
# Task 3c code here...

#is_weekend = rides_DataFrame['is_weekend'].isin([5,6])
rides_tripCriteria = rides_DataFrame[rides_DataFrame['is_weekend'].isin([5,6])]
print(rides_tripCriteria)
trip_criteria = rides_DataFrame.query('member_casual == "casual" and is_weekend == True and trip_duration > 20 and rideable_type == "electric_bike"')
trip_criteria_columns = trip_criteria[['member_casual', 'is_weekend', 'trip_duration', 'rideable_type']]
num_trip_criteria = len(trip_criteria_columns)
print(f"Trips with the four criteria: \n{trip_criteria_columns}")
#print(rides_DataFrame['is_weekend'].unique())
#num_trip_criteria = len(trip_criteria)
#print(trip_criteria)
print(f"\n{num_trip_criteria} trips match these trip criteria")  
#print(num_trip_criteria)

percent_trip_criteria = (num_trip_criteria/len(rides_DataFrame)) * 100
#print(percent_trip_criteria)

print(f"\nPercent of trips these criteria represent: {percent_trip_criteria}%\n")  

#Length = len(trip_criteria_values['trip_duration'])
#Sum = trip_criteria_vxalues['trip_duration'].sum()
print(f"Average duration of these trips: {trip_criteria['trip_duration'].mean():.2f}") 


Empty DataFrame
Columns: [ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual, trip_duration, is_weekend, time_of_day]
Index: []
Trips with the four criteria: 
      member_casual  is_weekend  trip_duration  rideable_type
14332        casual        True      56.608250  electric_bike
14349        casual        True      30.063433  electric_bike
14361        casual        True     140.151717  electric_bike
14369        casual        True     131.828583  electric_bike
14456        casual        True      23.264000  electric_bike
...             ...         ...            ...            ...
88031        casual        True      25.337467  electric_bike
88097        casual        True      22.346083  electric_bike
88197        casual        True      81.179050  electric_bike
88212        casual        True      23.110933  electric_bike
88273        casual        True      2

#### Task 3d: Explicit Indexing Practice

Practice using `loc[]` and `iloc[]` for different selection tasks:

1. Use `iloc[]` to select the first 10 trips, showing only `member_casual`, `rideable_type`, and `trip_duration_min` columns
2. Use `loc[]` to select trips from October 15-17 (use date strings `'2024-10-15':'2024-10-17'`), showing the same three columns
3. Count how many trips occurred during this date range

Note: When using `iloc[]`, remember it's position-based (0-indexed). When using `loc[]` with the datetime index, you can slice using date strings.

#### Brainstorming  
1. Uing `iloc[]` to find the 1st 10 trips of the columns `member_casual`, `rideable_type`, and `trip_duration_min` 
    - Finding rows in certain columns ex.: `[row 1, columns 0 and 1]`
    - `frame3.iloc[1, 0:2]`: **Remember**, 2 is exclusive!
2. I had trouble finding the dates in these columns since the dates are rows
   - I asked Cluade, and it reminded me that since the dates are rows, the dates need to be set as the index 

##### Your Code

In [163]:
# Task 3d code here...
#rides_DataFrame_date = rides_DataFrame.set_index('started_at')
three_Columns = rides_DataFrame_date[['member_casual', 'rideable_type', 'trip_duration']]
#print(three_Columns) 
firstTen_trips = three_Columns.iloc[:10]  
print(f"First ten trips for the columns 'member_casual', 'rideable_type', 'trip_duration': \n\n{firstTen_trips}\n")  

October_trips = three_Columns.loc['2024-10-15':'2024-10-17']  
print(f"Trips made in the date range '2024-10-15' to '2024-10-17': \n\n{October_trips}\n")  
#print(October_trips)  

Num_October_trips = len(October_trips)
print(f"Number of trips made during the date range '2024-10-15' to '2024-10-17': {Num_October_trips}\n")  

First ten trips for the columns 'member_casual', 'rideable_type', 'trip_duration': 

                        member_casual  rideable_type  trip_duration
started_at                                                         
2024-09-30 23:12:01.622        casual  electric_bike      67.984200
2024-09-30 23:19:25.409        casual  electric_bike      82.742067
2024-09-30 23:32:24.672        member   classic_bike      50.899583
2024-09-30 23:42:11.207        casual   classic_bike      28.093733
2024-09-30 23:49:25.380        member  electric_bike      17.034933
2024-09-30 23:49:40.016        member  electric_bike      13.009367
2024-10-01 00:00:53.414        member   classic_bike       2.598817
2024-10-01 00:05:44.954        member  electric_bike       0.013433
2024-10-01 00:06:12.035        member  electric_bike      10.472933
2024-10-01 00:10:03.646        member   classic_bike       7.825683

Trips made in the date range '2024-10-15' to '2024-10-17': 

                        member_casual

#### Interpretation

Reflect on this problem and answer (briefly / concisely) the following questions:

1. `isin()` advantages: Compare using `isin(['A', 'B', 'C'])` versus `(col == 'A') | (col == 'B') | (col == 'C')`. Beyond readability, what practical advantage does `isin()` provide when filtering for many values (e.g., 20+ stations)?
2. Conditional assignment order: In Task 3b, why did we initialize all values to "Midday" before assigning rush hour categories? What would go wrong if you assigned categories in a different order, or didn't set a default?
3. `query()` vs boolean indexing: The `query()` method in Task 3c could have been written with boolean indexing instead. When would you choose `query()` over boolean indexing? When might boolean indexing be preferable despite being more verbose?

##### Your Answers

#### Problem 3 interpretation

1. The syntax of `isin(['A', 'B', 'C'])` is simpler and does not require comparison and logical operators, including `==` and `|` in this case, which can reduce the error in code syntax.
2. We initialized all the values to "Midday" for one specific reason and another reason that I believe could be true.
   - The specific reason is that the `Morning Rush` and `Evening Rush` hours were certain values, and `Midday` hours were the rest of the values  
   - Additionally, it ensures that none of the values would be labeled as a non-number, `NaN`, by default.
   - If `Morning Rush` and `Evening Rush` categories were assigned first, the rest of the values, I believe, would be labeled as a non-number, `NaN`, by default
3. I would choose to use `query()` over indexing when I need to find specific categories or ranges in several different columns.
    - The one detail that can be forgettable, however, is including the `@` symbol when using a string, because this tells Python to look for a variable. In this case, it could be preferable to use Boolean indexing. 

#### Follow-Up (Graduate Students Only)

Pandas supports a variety of indexing paradigms, including bracket notation (`df['col']`), label-based indexing (`loc[]`), and position-based indexing (`iloc[]`). The lecture recommended using bracket notation only for columns, and loc/iloc for everything else. Explain the rationale: why is this approach better than using bracket notation for everything, even though `df[0:5]` technically works for row slicing?

##### Your Answers

#### Graduate follow-up interpretation 
- `loc[]` provides a consistent way to access labeled rows and/or columns. With this method/approach, indexing can be accessed with the name of the columns and rows, versus having to use the column and row numbers.
- `iloc[]` provides a consistent way to access data by the numbered index 

### Problem 4: Temporal Analysis and Export

Time-based patterns are crucial for understanding bike share usage. In this problem, you'll analyze when trips occur, how usage differs between rider types, and export filtered results. You'll use the datetime index you set in Problem 2 and the derived columns from Problems 2-3.

#### Task 4a: Identify Temporal Patterns

Use the datetime index to extract temporal components and identify usage patterns:

1. Extract the *hour* from the index and use `value_counts()` to find the most popular hour for trips. Report the peak hour and how many trips occurred during that hour.
2. Extract the *day name* from the index and use `value_counts()` to find the busiest day of the week. Report the day and number of trips.
3. Sort the results from step 2 to show days in order from Monday to Sunday (not by trip count). Use `sort_index()`.

Hint: Use `.dt.hour` and `.dt.day_name()` on the datetime index.

##### Your Code

In [188]:
# Task 4a code here...

hours = rides_DataFrame_date.index.hour # Extracting hour values from the index
hour_count = hours.value_counts()
print(f"Most popular hours for trips: \n{hour_count}\n")

peak_hour = hour_count.idxmax()
print(f"Peak hour: {peak_hour}\n")

peak_count = hour_count.max()
print(f"Number of trips during the peak hour: {peak_count}\n")

days = rides_DataFrame_date.index.day_name() 
#print(days)
days_count = days.value_counts()

days_ordered = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
days_count_ordered = days_count.reindex(days_ordered)
print(f"Number of trips during each of the week: \n{days_count_ordered}\n")

busiest_day = days_count.idxmax()
print(f"Busiest day of the week: {busiest_day}\n")

numTrips_busiestDay = days_count.max()
print(f"Number of trips on the busiest day: {numTrips_busiestDay}\n")


Most popular hours for trips: 
started_at
17    10574
16     9705
18     7939
15     7541
14     6392
8      6391
13     6131
12     5818
11     5070
19     4885
7      4776
9      4703
10     4498
20     3461
21     2725
6      2281
22     2107
23     1507
0      1121
5       749
1       719
2       429
3       252
4       226
Name: count, dtype: int64

Peak hour: 17

Number of trips during the peak hour: 10574

Number of trips during each of the week: 
started_at
Monday       11531
Tuesday      14970
Wednesday    16513
Thursday     16080
Friday       13691
Saturday     14427
Sunday       12788
Name: count, dtype: int64

Busiest day of the week: Wednesday

Number of trips on the busiest day: 16513



#### Task 4b: Compare Groups with groupby()

Use `groupby()` (introduced in 07a) to compare trip characteristics across different groups:

1. Calculate the average trip duration by rider type (`member_casual`). Which group takes longer trips on average?
2. Calculate the average trip duration by bike type (`rideable_type`). Which bike type has the longest average trip?
3. Count the number of trips by rider type using `groupby()` with `.size()`. Compare this with using `value_counts()` on the `member_casual` column - do they give the same result?

Note: Use single-key groupby only (one column at a time).

3. Result:
    - `groupby()` with `.size()` and `value_counts('member_casual')` return a similar result.
    - The only *difference* is the order of the categorys 

##### Your Code

In [190]:
# Task 4b code here...

rider_type = rides_DataFrame_date.groupby('member_casual')['trip_duration'].mean()  
print(rider_type)
print(f"\n{rider_type.idxmax().capitalize()} groups take longer trips on average.\n")

bike_type = rides_DataFrame.groupby('rideable_type')['trip_duration'].mean()  
print(bike_type)
print(f"\n{bike_type.idxmax().capitalize()}s take longer trips on average.\n")

rider_type = rides_DataFrame.groupby('member_casual')['trip_duration'].size()  
#print(rider_type)
print(f"Number of trips by each rider type: \n{rider_type} \n")

rider_type = rides_DataFrame.value_counts('member_casual')  
#print(rider_type)
print(f"Number of trips by each rider type: \n{rider_type} \n")

member_casual
casual    23.978046
member    11.984493
Name: trip_duration, dtype: float64

Casual groups take longer trips on average.

rideable_type
classic_bike     20.337410
electric_bike    12.033618
Name: trip_duration, dtype: float64

Classic_bikes take longer trips on average.

Number of trips by each rider type: 
member_casual
casual    34686
member    65314
Name: trip_duration, dtype: int64 

Number of trips by each rider type: 
member_casual
member    65314
casual    34686
Name: count, dtype: int64 



  rider_type = rides_DataFrame_date.groupby('member_casual')['trip_duration'].mean()
  bike_type = rides_DataFrame.groupby('rideable_type')['trip_duration'].mean()
  rider_type = rides_DataFrame.groupby('member_casual')['trip_duration'].size()


#### Task 4c: Filter, Sample, and Export

Create a filtered dataset for weekend electric bike trips and export it:

The provided code once again uses Path to create an `output` directory and constructs the full file path as `output/weekend_electric_trips.csv`. Use the `output_file` variable when calling `.to_csv()`.

1. Filter for trips where `is_weekend == True` and `rideable_type == 'electric_bike'`
2. Use `iloc[]` to select the first 1000 trips from this filtered dataset
3. Use `reset_index()` to convert the datetime index back to a column (so it's included in the export)
4. Export to CSV with filename `weekend_electric_trips.csv`, including only these columns: `started_at`, `ended_at`, `member_casual`, `trip_duration_min`, `time_of_day`
5. Use `index=False` to avoid writing the default numeric index to the file

After exporting, report how many total weekend electric bike trips existed before sampling to 1000.

##### Your Code

In [159]:
# do not modify this setup code
from pathlib import Path

output_dir = Path('output')
output_dir.mkdir(exist_ok=True)
output_file = output_dir / 'weekend_electric_trips.csv'

# Task 4c code here...
# use the variable `output_file` as the filename for step 4

weekend_electric_trips = rides_DataFrame_date.query('is_weekend == True and rideable_type == "electric_bike"')
#print(weekend_electric_trips)
output_file = weekend_electric_trips.iloc[:1000].reset_index()

# Saving specific columns
output_file.to_csv('weekend_electric_trips.csv', 
          columns=['started_at', 'ended_at', 'member_casual', 'trip_duration', 'time_of_day'],
          index=False)

# Verifying what was written to the csv file
with open('weekend_electric_trips.csv', 'r') as f:
    content = f.read()
    print(content[:1000], '...')  # First 1000 characters





started_at,ended_at,member_casual,trip_duration,time_of_day
2024-10-01 00:05:44.954,2024-10-01 00:05:45.760,member,0.013433333333333334,Midday
2024-10-01 00:06:12.035,2024-10-01 00:16:40.411,member,10.472933333333334,Midday
2024-10-01 00:10:29.526,2024-10-01 00:17:04.344,casual,6.580299999999999,Midday
2024-10-01 00:13:10.580,2024-10-01 00:13:26.944,casual,0.2727333333333333,Midday
2024-10-01 00:13:18.415,2024-10-01 00:16:57.230,member,3.646916666666667,Midday
2024-10-01 00:14:36.529,2024-10-01 00:23:45.959,casual,9.157166666666665,Midday
2024-10-01 00:14:48.058,2024-10-01 00:15:21.393,member,0.5555833333333333,Midday
2024-10-01 00:17:03.628,2024-10-01 00:17:18.589,member,0.24935000000000002,Midday
2024-10-01 00:23:29.903,2024-10-01 00:24:08.807,casual,0.6484000000000001,Midday
2024-10-01 00:24:43.011,2024-10-01 01:21:26.197,casual,56.71976666666667,Midday
2024-10-01 00:24:51.949,2024-10-01 00:25:51.820,casual,0.99785,Midday
2024-10-01 00:27:03.582,2024-10-01 00:33:10.798,member,6.1202

#### Interpretation

Reflect on this problem and answer the following questions:

1. `groupby() conceptual model`: Explain in your own words what `groupby()` does. Use the phrase "split-apply-combine" in your explanation and describe what happens at each stage.
2. `value_counts()` vs `groupby()`: In Task 4b.3, you compared two approaches for counting trips by rider type. When would you use `value_counts()` versus `groupby().size()`? Is there a situation where only one of them would work?
3. Index management for export: In Task 4c, why did we use `reset_index()` before exporting? What would happen if you exported with the datetime index still in place and used `index=False`?

##### Your Answers

#### Problem 4 interpretation here 
1. `groupby()` allows you to see a certain type of numeric data for each category in a column  
    - The type of numeric data in this case is the duration of trips  
    - This model uses the concept of "split-apply-combine": this explanation is below
       - It splits/takes numeric values from another column and displays them with another column 
       - `bike_type = rides_DataFrame.groupby('rideable_type')['trip_duration'].mean() `
       - In the code above, `groupby()` *splits* the columns `rideable_type` and `trip_duration` from `rides_DataFrame` and then combines them to be side-by-side columns 
          - **Additionally**, it applies and displays the average/mean duration for each category of `rideable_type`
1. From the result of using these two methods, it appears that `groupby().size()` returns the categories in alphabetical order.
    - Therefore, I guess I would use `value_counts()` when I am not worried about the order of the categories
    - I cannot think of a situation where only one of them would work. 
   
1. We reset the index so that `start_at` could be accessed as a column again, instead of the index.
   - If exported with the datetime index in place in addition to using `index=False`, an error will be returned 
       

#### Follow-Up (Graduate Students Only)

Compare `CSV` and _pickle_ formats for data storage and retrieval.

Pickle is Python's built-in serialization format that saves Python objects exactly as they exist in memory, preserving all data types, structures, and metadata. Unlike CSV (which converts everything to text), pickle is binary (not human readable) and maintains the complete state of your DataFrame. Also, pickle files only work in Python, while CSV is universal. Read more in the [Pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html).

The code below investigates an interesting pattern: Do riders take longer trips from scenic lakefront stations even during rush hours? This could indicate tourists or recreational riders using these popular locations for leisure trips during typical commute times. The analysis filters for trips over 15 minutes that started from lakefront stations during morning (7-9am) or evening (4-6pm) rush hours, sorted by duration to see the longest trips first.

Run the code below, then answer the interpretation questions:

In [166]:
import os

# the following lines were commented out since they were run in 4c
# from pathlib import Path
# output_dir = Path('output')

rides = rides_DataFrame_date

csv_file = output_dir / 'lakefront_rush_trips.csv'
pickle_file = output_dir / 'lakefront_rush_trips.pkl'

# Filter for interesting pattern: Long trips (>15 min) during rush hours 
# from lakefront stations, sorted by duration
lakefront_rush = (rides
    .loc[(rides.index.hour.isin([7, 8, 9, 16, 17, 18]))]
    .loc[(rides['start_station_name'].str.contains('Lake Shore|Lakefront', 
                                                    case=False, 
                                                    na=False))]
    .loc[rides['trip_duration'] > 15]
    .sort_values('trip_duration', ascending=False)
    .head(1000)
    .reset_index()
    [['started_at', 'ended_at', 'start_station_name', 'end_station_name',
      'member_casual', 'rideable_type', 'trip_duration']]
)

print(f"Found {len(lakefront_rush)} long rush-hour trips from lakefront stations")

# Export to both formats
lakefront_rush.to_csv(csv_file, index=False)
lakefront_rush.to_pickle(pickle_file)

# Compare file sizes
csv_size = os.path.getsize(csv_file) / 1024  # Convert to KB
pickle_size = os.path.getsize(pickle_file) / 1024
print(f"\nCSV file size: {csv_size:.2f} KB")
print(f"Pickle file size: {pickle_size:.2f} KB")
print(f"Size difference: {abs(csv_size - pickle_size):.2f} KB")

# Compare load times
print("\nLoad time comparison:")
print("CSV:")
%timeit pd.read_csv(csv_file)
print("\nPickle:")
%timeit pd.read_pickle(pickle_file)

# Check data type preservation
# Note: CSV load without parse_dates loses datetime types
csv_loaded = pd.read_csv(csv_file)
pickle_loaded = pd.read_pickle(pickle_file)

print("\nData types from CSV (without parse_dates):")
print(csv_loaded.dtypes)
print("\nData types from Pickle:")
print(pickle_loaded.dtypes)

Found 310 long rush-hour trips from lakefront stations

CSV file size: 40.57 KB
Pickle file size: 55.16 KB
Size difference: 14.59 KB

Load time comparison:
CSV:
2.75 ms ± 142 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pickle:
1.18 ms ± 80.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Data types from CSV (without parse_dates):
started_at             object
ended_at               object
start_station_name     object
end_station_name       object
member_casual          object
rideable_type          object
trip_duration         float64
dtype: object

Data types from Pickle:
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name          category
end_station_name            category
member_casual               category
rideable_type               category
trip_duration                float64
dtype: object


After running the code, answer these questions:

1. Method chaining: The analysis uses method chaining with a specific formatting pattern:

   ```python
   result = (df
       .method1()
       .method2()
       .method3()
   )
   ```

   This wraps the entire chain in parentheses, allowing each method to appear on its own line without backslashes. Discuss why this makes formatting more readable, how it makes debugging easier, how it relates to seeing changes in the code with git diff, and what downsides heavy chaining might have.
3. Data types: Compare the dtypes from CSV versus pickle. What types were preserved by pickle that were lost in CSV? Why is this preservation significant for subsequent analysis?
4. Trade-offs: Given your observations about size, speed, and type preservation, when would you choose pickle over CSV for your work? When would CSV still be the better choice despite pickle's advantages?


#### Graduate follow-up interpretation here
1. From a coding standpoint, a single file of code could be hundreds of lines long. Therefore, formatting code to be more readable, like method chaining, can help the developer/creator of the code be able to understand and read their code more efficiently because it is in a neat format.
    - Since each method is on a *separate line*, debugging code can be less stressful
    - The downsides of heavy chaining could lead to the overuse of method chaining, such as an unnecessary amount of coding lines
1. The dtypes `datetime64[ns]` for the columns `started_at` and `ended_at` were preserved by pickle.
   - This preservation is significant because it labels this variable by the dtype that it is known for in the data set, which is a variable representing a date. A subsequent analysis might call for the date to be labeled by its specific data type.
1. I would use pickle if I had needed access to specific dtypes in a high-pressure environment where speed and efficiency were necessary, and of course, I had computer memory that could handle the file size for pickle.
    - However, CSV would be the better choice if I did not have the memory availability to handle a large file size.
          - In this case, the runtime would be a little slower CSV would still help me see whether a value is a number, `int`, or `float`, or an object, a string. 

## Reflection

Address the following questions in a markdown cell:

1. NumPy vs Pandas
   - What was the biggest conceptual shift moving from NumPy arrays to Pandas DataFrames?
   - Which Pandas concept was most challenging: indexing (loc/iloc), missing data, datetime operations, or method chaining? How did you work through it?
2. Real Data Experience
   - How did working with real CSV data (with missing values, datetime strings, etc.) differ from hw2b's synthetic NumPy arrays?
   - Based on this assignment, what makes Pandas well-suited for data analysis compared to pure NumPy?
3. Learning & Application
   - Which new skill from this assignment will be most useful for your own data work?
   - On a scale of 1-10, how prepared do you feel to use Pandas for your own projects? What would increase that score?
4. Feedback
   - Time spent: ___ hours (breakdown optional)
   - Most helpful part of the assignment: ___
   - One specific improvement suggestion: ___

### Your Answers

1. - The biggest conceptual was prabably understanding how to use the index feature of Pandas DataFrames
   - probably learning how to use loc/iloc for indexing properly, and understanding why I was getting True and False to show up using `isin()`
   - I used Claude to troubleshoot and help me understand output errors, and I also met with the TA to clarify some concepts and troubleshoot code
2. - Working with real CSV data allowed me to see the format of real data. It was interesting to see how missing values were displayed as NaN and how many missing values there were.
   - Pandas includes many methods and functions that allow manipulation and sorting of data a more efficient process versus pure NumPy
3. Many new skills would be most useful, such as using `loc()` / `iloc()` and `groupby()`.
     - These are powerful tools when needing to find ranges of data and/or separate and combine data
   - I feel pretty prepared, being that I have never used a language this powerful
     - Definitely just having time to practice all the methods and techniques used in Pandas
4. - probably around 15 hours, since I am very new to these powerful languages
   - The steps, including what functions and methods to use initially
   - Definitely the workload of the assignment. It can be a lot to write a good interpretation in addition to understanding and writing code 


This URL is to my claude chat where I asked for clarifications:

https://claude.ai/share/ea251f8d-4739-46c2-ac57-1441ea5c6ded
