<a href="https://colab.research.google.com/github/boonecabaldev/pandas_exercises/blob/main/Pandas_Exercises_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Absolutely! Here's another set of pandas exercises to help you simulate real-world data analysis scenarios:

**Problem 1:  Analyzing Survey Data**

*File: `survey_responses.csv`*

```
RespondentID,Age,Gender,Occupation,Satisfaction
1,35,Male,Engineer,4
2,28,Female,Teacher,5
3,42,Male,Manager,3
4,22,Female,Student,4
5,55,Male,Retired,5
... (more rows)
```

**Tasks:**

1. Read the CSV into a DataFrame.
2. Calculate the average satisfaction rating.
3. Determine the distribution of occupations (count per occupation).
4. Find the average age and satisfaction rating for each gender.
5. Create a new column called `AgeGroup` categorizing respondents into 'Young' (under 30), 'Middle-Aged' (30-50), and 'Senior' (over 50).

**Solution:**

In [None]:
import pandas as pd

df_survey_responses = pd.read_csv('sample_data/survey_responses.csv')
average_satisfaction = df_survey_responses['Satisfaction'].mean()
print("Average Satisfaction Rating:", average_satisfaction)

occupation_counts = df_survey_responses['Occupation'].value_counts()
print("\nOccupation Distribution:\n")
print(occupation_counts)

grp_gendeer = df_survey_responses.groupby('Gender')
average_age_gender = grp_gendeer['Age'].mean()
average_satisfaction_gender = grp_gendeer['Satisfaction'].mean()
merged_age_satisfaction = pd.concat([average_age_gender, average_satisfaction_gender], axis=1)
print("\nAverage Age and Satisfaction Rating by Gender:\n")
print(merged_age_satisfaction)

Average Satisfaction Rating: 3.8

Occupation Distribution:

Occupation
Engineer       2
Teacher        2
Manager        2
Student        2
Retired        2
Doctor         2
Lawyer         2
Nurse          2
Salesperson    2
Accountant     2
Name: count, dtype: int64

Average Age and Satisfaction Rating by Gender:

              Age  Satisfaction
Gender                         
Female  33.375000           3.5
Male    39.333333           4.0


**Problem 2:  Handling Time Zones and Date Ranges**

*File: `flight_data.csv`*

```
FlightID,DepartureCity,ArrivalCity,DepartureTime,ArrivalTime
F123,New York,London,2023-06-15 08:00:00,2023-06-15 15:00:00 # EDT
F456,Los Angeles,Tokyo,2023-06-16 10:30:00,2023-06-17 18:00:00 # PDT
...(more rows with various time zones)
```

**Tasks:**

1. Read the CSV, parsing `DepartureTime` and `ArrivalTime` as datetime.
2. Convert times to UTC.
3. Calculate flight durations in hours.
4. Filter for flights departing between June 15th and June 20th (UTC).

**Solution:**

In [None]:
import pandas as pd

# Read CSV with datetime parsing
df_flights = pd.read_csv('sample_data/flight_data.csv', parse_dates=['DepartureTime', 'ArrivalTime'])

# Convert times to UTC
df_flights['DepartureTime'] = pd.to_datetime(df_flights['DepartureTime'], utc=True)
df_flights['ArrivalTime'] = pd.to_datetime(df_flights['ArrivalTime'], utc=True)

# Calculate flight durations in hours.
df_flights['DurationHours'] = (df_flights['ArrivalTime'] - df_flights['DepartureTime']).dt.total_seconds() / 3600

# Filter for flights departing between June 15th and June 20th (UTC).
start_date = pd.to_datetime('2023-06-15', utc=True)
end_date = pd.to_datetime('2023-06-20', utc=True)
df_filtered_flights = df_flights[ (df_flights['DepartureTime'] >= start_date) & (df_flights['DepartureTime'] <= end_date) ]

print("\nFlights departing between June 15th and 20th (UTC):")
df_filtered_flights


Flights departing between June 15th and 20th (UTC):


Unnamed: 0,FlightID,DepartureCity,ArrivalCity,DepartureTime,ArrivalTime,DurationHours
0,F123,New York,London,2023-06-15 08:00:00+00:00,2023-06-15 15:00:00+00:00,7.0
1,F456,Los Angeles,Tokyo,2023-06-16 10:30:00+00:00,2023-06-17 18:00:00+00:00,31.5
2,F789,Chicago,Paris,2023-06-17 12:00:00+00:00,2023-06-18 21:00:00+00:00,33.0
3,F101,San Francisco,Berlin,2023-06-18 14:30:00+00:00,2023-06-19 23:30:00+00:00,33.0
4,F112,Seattle,Rome,2023-06-19 16:00:00+00:00,2023-06-20 01:00:00+00:00,9.0


In [None]:
import pandas as pd

# Read CSV with datetime parsing
df_flights = pd.read_csv('sample_data/flight_data.csv', parse_dates=['DepartureTime', 'ArrivalTime'])

# Assume time zones in data
df_flights['DepartureTime'] = pd.to_datetime(df_flights['DepartureTime'], utc=True)
df_flights['ArrivalTime'] = pd.to_datetime(df_flights['ArrivalTime'], utc=True)

# Flight duration
df_flights['DurationHours'] = (df_flights['ArrivalTime'] - df_flights['DepartureTime']).dt.total_seconds() / 3600

# Filter by departure date range (UTC)
start_date = pd.to_datetime('2023-06-15', utc=True)
end_date = pd.to_datetime('2023-06-20', utc=True)
filtered_flights = df_flights[
    (df_flights['DepartureTime'] >= start_date) & (df_flights['DepartureTime'] <= end_date)
]

print("\nFlights departing between June 15th and 20th (UTC):")
print(filtered_flights.to_markdown(index=False, numalign="left", stralign="left"))


Flights departing between June 15th and 20th (UTC):
| FlightID   | DepartureCity   | ArrivalCity   | DepartureTime             | ArrivalTime               | DurationHours   |
|:-----------|:----------------|:--------------|:--------------------------|:--------------------------|:----------------|
| F123       | New York        | London        | 2023-06-15 08:00:00+00:00 | 2023-06-15 15:00:00+00:00 | 7               |
| F456       | Los Angeles     | Tokyo         | 2023-06-16 10:30:00+00:00 | 2023-06-17 18:00:00+00:00 | 31.5            |
| F789       | Chicago         | Paris         | 2023-06-17 12:00:00+00:00 | 2023-06-18 21:00:00+00:00 | 33              |
| F101       | San Francisco   | Berlin        | 2023-06-18 14:30:00+00:00 | 2023-06-19 23:30:00+00:00 | 33              |
| F112       | Seattle         | Rome          | 2023-06-19 16:00:00+00:00 | 2023-06-20 01:00:00+00:00 | 9               |


```text
import pandas as pd

# Read CSV with datetime parsing
df_flights = pd.read_csv('flight_data.csv', parse_dates=['DepartureTime', 'ArrivalTime'])

# Assume time zones in data
df_flights['DepartureTime'] = pd.to_datetime(df_flights['DepartureTime'], utc=True)
df_flights['ArrivalTime'] = pd.to_datetime(df_flights['ArrivalTime'], utc=True)

# Flight duration
df_flights['DurationHours'] = (df_flights['ArrivalTime'] - df_flights['DepartureTime']).dt.total_seconds() / 3600

# Filter by departure date range (UTC)
start_date = pd.to_datetime('2023-06-15', utc=True)
end_date = pd.to_datetime('2023-06-20', utc=True)
filtered_flights = df_flights[
    (df_flights['DepartureTime'] >= start_date) & (df_flights['DepartureTime'] <= end_date)
]

print("\nFlights departing between June 15th and 20th (UTC):")
print(filtered_flights.to_markdown(index=False, numalign="left", stralign="left"))
```