# Practice Lab: Buenos Aires Subway - Descriptive statistics

Now you go back to your analysis of the Buenos Aires Subway system, and want to explore the busiest stations again. Now you want to capture some descriptive statistics so you can further quantify how much resources are needed to manage the traffic. You'll do just that in this practice lab. 

<div style="text-align: center;">
<img src="imgsL3/subway_map_wcircle.jpg" alt="Subway map" width="600"/>
</div>

## General instructions
- **Replace any instances of `None` with your own code**. All `None`s must be replaced.
- **Compare your results with the expected output** shown below the code.
- **Check the solution** using the expandable cell to verify your answer.

Happy coding!

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
<strong>Important note</strong>: Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 
</div>

## Table of Contents

- [Step 1: Load and filter the data](#step-1)

- [Step 2: Descriptive statistics](#step-2)

- [Step 3: Plot the station counts](#step-3)

- [Step 4: Statistics per station](#step-4)
 
- [Step 5: Statistics per hour and subway line](#step-5)

## Step 1: Load and filter the data
<a id="step-1-load-the-data"></a>
Begin by loading the data. It is the same one you used in the previous lab and has these features:

- `date`: date of the observation, in format YYYY-MM-DD
- `hour`: hour of the observation
- `station`: name of the subway station
- `line`: name of the subway line (A, B, C, D, E, H). Each line corresponds to one of the colored lines in the map above.
- `pax_TOTAL`: total number of passengers at the station 

In [1]:
import pandas as pd

# Load the data
df = pd.read_csv("march2024_pax_hourly.csv")

# Preview the result
df.head()

Unnamed: 0,date,hour,station,line,pax_TOTAL
0,2024-03-01,5,Acoyte,A,50
1,2024-03-01,5,Aguero,D,1
2,2024-03-01,5,Alberti,A,6
3,2024-03-01,5,Angel Gallardo,B,30
4,2024-03-01,5,Avenida La Plata,E,26


You want to calculate numbers for extreme conditions so you decide to filter the dataset.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    
**▶▶▶ Directions**
1. Use the `quantile()` method to get the 0.95 quantile of the `pax_TOTAL`. Store it in the <code>pax_95q</code> variable.

</div>

In [3]:
### START CODE HERE ###

# Get the 95th percentile of the total passengers
pax_95q = None

### END CODE HERE ###

# Filter the dataset
df_95q = df[df["pax_TOTAL"] > pax_95q]

# Sort the dataset
df_95q = df_95q.sort_values(by="pax_TOTAL", ascending=False)

# Print the number of rows
print("rows in this dataset:",len(df_95q))

rows in this dataset: 2461


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

```txt
rows in this dataset: 2461
```

</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Get the 95th percentile of the total passengers
pax_95q = df["pax_TOTAL"].quantile(0.95)
```
</details>

<a id="step-2"></a>

## Step 2: Descriptive statistics

Now you calculate some numbers to see how many passengers are in these busy conditions.

<a id="do-it-yourself"></a>

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">

**▶▶▶ Directions** 
1. Find the mean of total passengers in the <code>df_95q</code> DataFrame.
2. Find the median of total passengers in the <code>df_95q</code> DataFrame.
3. Find the maximum number of passengers in the <code>df_95q</code> DataFrame.
4. Find the standard deviation of passengers in the <code>df_95q</code> DataFrame.
</div>



In [4]:
### START CODE HERE ###

# find the sum, mean, and standard deviation of pax_TOTAL
pax_mean = None
pax_median = None
pax_max = None
pax_std = None

### END CODE HERE ###

print("The mean pax is:", pax_mean)
print("The median pax is:", pax_median)
print("The max pax is:", pax_max)
print("The standard deviation is:", pax_std)

The mean pax is: 2034.19179195449
The median pax is: 1580.0
The max pax is: 13651
The standard deviation is: 1429.2539690049282


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

```txt
The mean pax is: 2034.19179195449
The median pax is: 1580.0
The max pax is: 13651
The standard deviation is: 1429.2539690049282
```

</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# find the sum, mean, and standard deviation of pax_TOTAL
pax_mean = df_95q['pax_TOTAL'].mean()
pax_median = df_95q['pax_TOTAL'].median()
pax_max = df_95q['pax_TOTAL'].max()
pax_std = df_95q['pax_TOTAL'].std()
```
</details>

<a id="step-3"></a>

## Step 3: Plot the station counts

You want to know the top stations that encounter this kind of passenger traffic.

<a id="do-it-yourself"></a>

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">

**▶▶▶ Directions** 
1. Use the <code>value_counts()</code> on the 'station' column of the <code>df_95q</code> DataFrame.
2. Use the <code>head()</code> method to get the top 10 results.
3. Use the <code>plot()</code> method to display the results as a bar chart.
</div>



In [None]:
### START CODE HERE ###

# Get the value counts per station
station_counts = None

# Get the top 10 results
station_counts_10 = None

# Plot the top 10 results
None

### END CODE HERE ###

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 

<br>
<img src="imgsL3/output_step3.png" width="450">
</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Get the value counts per station
station_counts = df_95q["station"].value_counts()

# Get the top 10 results
station_counts_10 = station_counts.head(10)

# Plot the top 10 results
station_counts_10.plot(kind="bar")
```
</details>

<a id="step-3"></a>

## Step 4: Statistics per station

Next, you want to plot some statistics segmented per station. You will calculate the mean and return the results as a series. 

<a id="do-it-yourself"></a>

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">

**▶▶▶ Directions** 
1. Use the <code>groupby()</code> method to segment the data by 'station' of the <code>df_95q</code> DataFrame. Store the results in <code>grouped_by_station</code>
2. Use the <code>mean()</code> method on the 'pax_TOTAL' column of the <code>grouped_by_station</code> object. Store the results in <code>mean_per_station</code>
</div>

In [None]:
### START CODE HERE ###

# Segment the data by station
grouped_by_station = None

# Calculate the mean of the 'pax_TOTAL' column per station
mean_per_station = None

### END CODE HERE

# Sort the results and only print the top 10
mean_per_station.sort_values(ascending=False).head(10)

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 
<br>

```txt
station
Constitucion        3603.701456
Catedral            2134.927536
Plaza de Mayo       2067.885714
Retiro              1991.120968
Rosas               1982.820000
San Pedrito         1926.078704
Saenz Pena          1910.250000
Hospitales          1892.000000
Leandro N. Alem     1888.698413
Federico Lacroze    1830.571429
Name: pax_TOTAL, dtype: float64
```

</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Segment the data by station
grouped_by_station = df_95q.groupby("station")

# Calculate the mean of the 'pax_TOTAL' column per station
mean_per_station = grouped_by_station["pax_TOTAL"].mean()
```
</details>

<a id="step-5"></a>

## Step 5: Statistics per hour and subway line

Lastly, you want to calculate statistics both by hour and line. This should allow you to see which times these conditions occurred on each line.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li> Use the <code>pivot_table()</code> method on the <code>df_95q</code> DataFrame with these arguments: 
            <ul>
                <li>Use the <code>hour</code> feature as index (rows).</li>
                <li>Use the <code>line</code> feature as columns.</li>
                <li>Use the <code>pax_TOTAL</code> feature as the values to fill the table.</li>
                <li>Use the <code>mean</code> function to aggregate the values.</li>
            </ul>
        </ol>
</div>

In [None]:
### START CODE HERE ###

# Create the pivot table
pivot_hour_line = None

### END CODE HERE ###

# Print the results
pivot_hour_line

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 

<br>
<img src="imgsL3/output_step5.png" width="500">
</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Create the pivot table
pivot_hour_line = df_95q.pivot_table(index="hour", 
                                     columns="line",
                                     values="pax_TOTAL",
                                     aggfunc="mean")

# Print the results
pivot_hour_line
```
</details>

Now you see that some lines can have busy stations almost throughout the day. Some have an interesting pattern, like line E where it's usually below the threshold but encountered high passenger counts well into the night (hour 21 and 22).

**Congratulations on completing this series of practice labs!**