# Session 3 - Data Transformations, DateTime Operations & EDA Concepts

**Q1** - DateTime Feature Engineering

You have a DataFrame with a timestamp column (string format: "2024-03-15 14:30:00"). You need to create three new features: hour_of_day, day_name, and is_weekend. What's the correct sequence of operations?

```python
df.['date'] = pd.to_datetime(df['date']) 

df.['day_name'] = df['date'].dt.day_name
df.['is_weekend'] = df['date'].dt.dayofweek.isin([5,6])
df.['hour_of_day'] = df['date'].dt.hour

**Q2** - Vectorized vs .apply() Decision

You need to create a new column that categorizes trip durations:

* "short" if < 30 minutes
* "medium" if 30-60 minutes
* "long" if > 60 minutes

Should you use vectorized operations or .apply() for this task? Briefly explain why.

Use .apply() for this task because of its a complex ask to split the filter up and return a string back, essentially creating a categorical column based on filter results. 

**INCORRECT** --- Simple logic for a vectorized approach using np.select() or condition chaining.  

_preferred approach_

```python
conditions = [
    df['duration'] < 30,
    (df['duration'] >= 30) & (df['duration'] <= 60),
    df['duration'] > 60
]

choices = ['short','medium','long']

df['category'] = np.select(conditions,choices)

**Q3**  - Anscombe's Quartet - The Core Lesson 

Four datasets have nearly identical summary statistics (mean, std dev, correlation). What is the fundamental lesson Anscombe's Quartet teaches us about data analysis, and why does this matter for real-world work?

The core lesson is that even though the datasets are statistically similar, their visualizations are different which tells us that the data sets' real world significances are not at all similar. 

**Fundamental Lesson** : 
Summary statistics alone (mean, std dev, correlation) can be misleading or insufficient. Datasets with identical statistics can have completely different structures, patterns, and meanings. You MUST visualize your data to understand what's actually going on.

**ANSWERS**

Q1 - 
```python
df['date'] = pd.to_datetime(df['date']) 

df['day_name'] = df['date'].dt.day_name
df['is_weekend'] = df['date'].dt.dayofweek.isin([5,6])
df['hour_of_day'] = df['date'].dt.hour
```

Q2 - 
Use .apply() for this task because of its a complex ask to split the filter up and return a string back, essentially creating a categorical column based on filter results. 

Q3 - 
The core lesson is that even though the datasets are statistically similar, their visualizations are different which tells us that the data sets' real world significances are not at all similar. 


**Q4** - Data Type Identification

Classify each variable and explain what analysis is appropriate:

1) Customer satisfaction rating (1-5 stars)
2) Temperature in Celsius
3) Customer ID number
4) Payment method (credit, debit, cash, crypto)

1 - ordered categorical (it will appear as a number but there are only 5 options and they need to be considered in terms of their order)
2 - continuous (float64 the value could be any number with a decimal included)
3 - object (customer ID, could technically be an integer but we wouldn't want to accidentally aggregate these numbers)
4 - unordered categorical (fixed number of values in a list, string type)

**Q5** - Validation vs Verification

You're analyzing taxi trip data. You calculate that the average trip duration is 15.3 minutes and your code runs without errors. Have you performed validation, verification, both, or neither? Explain.

Verification - The code works
Validation - The results are good

for this example the code is verified but not validated. 

**Q6** - Feature Engineering Performance

You have a 1 million row DataFrame. You need to create a feature that extracts the hour from a datetime column. Estimate the performance difference between using .dt.hour vs .apply(lambda x: x.hour). Which would you choose and why?

I would use dt.hour over .apply(lambda) because .dt is vectorized and optimized for this operation and the apply will have to execute row-by-row making the process much longer. 

**ANSWERS**

Q4 - 

1 - ordered categorical (it will appear as a number but there are only 5 options and they need to be considered in terms of their order)
2 - continuous (float64 the value could be any number with a decimal included)
3 - object (customer ID, could technically be an integer but we wouldn't want to accidentally aggregate these numbers)
4 - unordered categorical (fixed number of values in a list, string type)

Q5 - 
Verification - The code works
Validation - The results are good

for this example the code is verified but not validated. 

Q6 - 
I would use dt.hour over .apply(lambda) because .dt is vectorized and optimized for this operation and the apply will have to execute row-by-row making the process much longer. 

**Q7** - Descriptive Statistics - Why Multiple Measures?

You calculate the mean trip duration is 15 minutes. Your colleague says "Great, now we understand trip durations!" Why is relying on the mean alone potentially misleading? Name at least two other statistics you'd want to check.

Relying on the mean hides outliers. We'd want to check the min, max, standard deviation, and the inner quartile ranges. 

**Q8** - DataFrame Terminology

In a machine learning context with a DataFrame predicting taxi fares:

What do we call the columns trip_distance, passenger_count, hour_of_day?
What do we call the column fare_amount (what we're predicting)?
What do we call each row?

The columns are features, the the fare_amount is the target , and the rows are observations.

**Q9** -  EDA Philosophy - Classical vs Exploratory

Classical statistics emphasizes hypothesis testing and p-values before looking at data. Exploratory Data Analysis (EDA) emphasizes visualization and pattern discovery first. In the taxi trip dataset, which approach would you use to decide if you should create separate models for weekday vs weekend trips? Why?

I would chose the EDA approach, because I can easily split and compare the differences in the data visualy to determine if it would be valuable to assess them seperately. 

**ANSWERS**

Q7 - 
Relying on the mean hides outliers. We'd want to check the min, max, standard deviation, and the inner quartile ranges. 

Q8 - 
The columns are features, the the fare_amount is the target , and the rows are observations.

Q9 - 
I would chose the EDA approach, because I can easily split and compare the differences in the data visualy to determine if it would be valuable to assess them seperately. 

**Q10** -  Real-World Scenario - DateTime Feature Engineering

You're analyzing ride-share data and notice the model performs poorly for early morning trips (12am-5am). You have a pickup_datetime column. What temporal features would you engineer to help the model distinguish these trips, and why might they be predictive?

I would engineer a categorical field for df['time_of_day'] for example which filters for hourly ['early_morning','morning','afternoon','early evening','night']. This may help the model to perform but condensing these trips into a category so that the sample size is larger. 

**Q11** - Anscombe's Quartet Application

Your colleague reports: "I analyzed our customer satisfaction data. Mean = 3.2 stars, correlation between price and satisfaction = -0.15. Conclusion: price barely affects satisfaction." Based on Anscombe's Quartet lesson, what would you ask them before accepting this conclusion?

Based on Anscombe's Quartet lesson, which emphasizes visualation over statistical representation... I would ask what visualizations they put together and what they were able to interpret from them. 

**Q12** - Data Type and Analysis Choice

You're building a model to predict taxi fares. You have a pickup_borough feature (Manhattan, Brooklyn, Queens, Bronx, Staten Island). Your colleague suggests encoding it as numbers 1-5 and using it directly in linear regression. What's wrong with this approach? What data type is pickup_borough and how should it be handled?

pickup_borough feature is a categorical feature already and doesn't need to be recoded. It should be modeled as mode. Recoding as a numbers will impact the regression model

**ANSWERS**

Q10 - I would engineer a categorical field for df['time_of_day'] for example which filters for hourly ['early_morning','morning','afternoon','early evening','night']. This may help the model to perform but condensing these trips into a category so that the sample size is larger. 

Q11 - Based on Anscombe's Quartet lesson, which emphasizes visualation over statistical representation... I would ask what visualizations they put together and what they were able to interpret from them. 

Q12 - 
pickup_borough feature is a categorical feature already and doesn't need to be recoded. It should be modeled as mode. Recoding as a numbers will impact the regression model
