The notebook will focus on data filtering, slicing, summarizing, and presenting.  

In [None]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_excel('/content/automobile_dataset.xlsx')

# Drop null values in `Repair record 1978`
df.dropna(subset = ['Repair record 1978'], inplace=True)

# Convert columns to numeric
df['Price'] = pd.to_numeric(df['Price'])
df['Mileage (mpg)'] = pd.to_numeric(df['Mileage (mpg)'])
df['Weight (lbs.)'] = pd.to_numeric(df['Weight (lbs.)'])
df['Displacement (cu. in.)'] = pd.to_numeric(df['Displacement (cu. in.)'])

# Create new column `Weight_kg`
df['Weight_kg'] = df['Weight (lbs.)'] * 0.453592

# Create new column `Price_per_100km`
df['Price_per_100km'] = (df['Price'] / df['Mileage (mpg)']) * 235.215

# Create new column `Efficient`
mean_mileage = df['Mileage (mpg)'].mean()
df['Efficient'] = ['Yes' if mileage > mean_mileage else 'No' for mileage in df['Mileage (mpg)']]

# Print the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

```text?code_stdout&code_event_index=3
| Make and model   | Price   | Mileage (mpg)   | Repair record 1978   | Headroom (in.)   | Trunk space (cu. ft.)   | Weight (lbs.)   | Length (in.)   | Turn circle (ft.)   | Displacement (cu. in.)   | Gear ratio   | Car origin   | Weight_kg   | Price_per_100km   | Efficient   |
|:-----------------|:--------|:----------------|:---------------------|:-----------------|:------------------------|:----------------|:---------------|:--------------------|:-------------------------|:-------------|:-------------|:------------|:------------------|:------------|
| AMC Concord      | 4099    | 22              | 3                    | 2.5              | 11                      | 2930            | 186            | 40                  | 121                      | 3.58         | Domestic     | 1329.02     | 43824.8           | Yes         |
| AMC Pacer        | 4749    | 17              | 3                    | 3                | 11                      | 3350            | 173            | 40                  | 258                      | 2.53         | Domestic     | 1519.53     | 65708             | No          |
| Buick Century    | 4816    | 20              | 3                    | 4.5              | 16                      | 3250            | 196            | 40                  | 196                      | 2.93         | Domestic     | 1474.17     | 56639.8           | No          |
| Buick Electra    | 7827    | 15              | 4                    | 4                | 20                      | 4080            | 222            | 43                  | 350                      | 2.41         | Domestic     | 1850.66     | 122735            | No          |
| Buick LeSabre    | 5788    | 18              | 3                    | 4                | 21                      | 3670            | 218            | 43                  | 231                      | 2.73         | Domestic     | 1664.68     | 75634.7           | No          |

```
Here's the Python notebook for the follow-up lesson, incorporating the data transformations you requested:

## Pandas: Filtering, Slicing, Summarizing, and Presenting Data

In this lesson, we'll build on your basic Pandas knowledge and explore how to:

1. **Filter:** Select specific rows based on conditions.
2. **Slice:** Extract subsets of your data.
3. **Summarize:** Calculate statistics and aggregate information.
4. **Present:** Display your findings clearly.

We'll continue using the automobile dataset from the previous lesson.

In [None]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_excel('/content/automobile_dataset.xlsx')

# Drop null values in `Repair record 1978`
df.dropna(subset = ['Repair record 1978'], inplace=True)

# Convert columns to numeric
df['Price'] = pd.to_numeric(df['Price'])
df['Mileage (mpg)'] = pd.to_numeric(df['Mileage (mpg)'])
df['Weight (lbs.)'] = pd.to_numeric(df['Weight (lbs.)'])
df['Displacement (cu. in.)'] = pd.to_numeric(df['Displacement (cu. in.)'])

# Create new column `Weight_kg`
df['Weight_kg'] = df['Weight (lbs.)'] * 0.453592

# Create new column `Price_per_100km`
df['Price_per_100km'] = (df['Price'] / df['Mileage (mpg)']) * 235.215

# Create new column `Efficient`
mean_mileage = df['Mileage (mpg)'].mean()
df['Efficient'] = ['Yes' if mileage > mean_mileage else 'No' for mileage in df['Mileage (mpg)']]

# Print the first 5 rows
print(df.head().to_markdown(index=False, numalign="left", stralign="left"))

### Section 1: Filtering Data

**Filtering by Condition:**

In [None]:
# Filter cars with high mileage (greater than 30 mpg)
high_mileage_cars = df[df['Mileage (mpg)'] > 30]
print("\nHigh Mileage Cars:\n", high_mileage_cars[['Make and model', 'Mileage (mpg)']].to_markdown(index=False, numalign="left", stralign="left"))

**Multiple Choice Question 1:**

Which operator is used to filter rows where the 'Price' is less than 5000 AND the 'Car origin' is 'Domestic'?

a) `|` (or)
b) `&` (and)
c) `~` (not)
d) `^` (xor)

**Answer:** b) `&` (and)

**Filtering with `.isin()`:**

In [None]:
# Filter cars made by Ford or Chevrolet
ford_or_chev = df[df['Make and model'].str.contains('Ford|Chev')]
print("\nFord or Chevrolet Cars:\n", ford_or_chev[['Make and model']].to_markdown(index=False, numalign="left", stralign="left"))

**Challenge 1:**

Filter the DataFrame to show only cars that are either 'Domestic' or 'Efficient' and have a `Price_per_100km` less than 70.

In [None]:
# Write solution here.

### Section 2: Slicing Data

**Selecting Rows and Columns:**

In [None]:
# Select the first 3 rows and specific columns
subset = df.loc[:2, ['Make and model', 'Price', 'Mileage (mpg)']]
print("\nSubset of Data:\n", subset.to_markdown(index=False, numalign="left", stralign="left"))

**Multiple Choice Question 2:**

What does the `df.iloc[5:10, 1:3]` statement do?

a) Selects rows 5 to 10 and all columns.
b) Selects rows 5 to 9 and columns 1 and 2.
c) Selects rows 0 to 4 and columns 1 and 2.
d) Selects all rows and columns 5 to 9.

**Answer:** b) Selects rows 5 to 9 (exclusive of 10) and columns 1 and 2 (exclusive of 3).

**Challenge 2:**

Select the rows where the `Car origin` is 'Foreign' and display only the columns `Make and model`, `Weight_kg`, and `Price_per_100km`.

In [None]:
# Write solution here.

### Section 3: Summarizing Data

**Descriptive Statistics:**

In [None]:
# Get summary statistics for numerical columns
summary_stats = df[['Price', 'Mileage (mpg)', 'Weight_kg', 'Price_per_100km']].describe()
print("\nSummary Statistics:\n", summary_stats.to_markdown(numalign="left", stralign="left"))

**Group By and Aggregate:**

In [None]:
# Calculate average price and mileage by car origin
average_by_origin = df.groupby('Car origin')[['Price', 'Mileage (mpg)']].mean()
print("\nAverage Price and Mileage by Origin:\n", average_by_origin.to_markdown(numalign="left", stralign="left"))

**Multiple Choice Question 3:**

Which function is used to calculate the average of a numerical column?

a) `df.sum()`
b) `df.mean()`
c) `df.median()`
d) `df.mode()`

**Answer:** b) `df.mean()`

**Challenge 3:**

Calculate the median `Weight_kg` and the minimum and maximum `Price_per_100km` for each `Car origin`.

In [None]:
# Write solution here.

### Section 4: Presenting Data

**Sorting:**

In [None]:
# Sort cars by price in descending order
sorted_by_price = df[['Make and model', 'Price']].sort_values('Price', ascending=False)
print("\nCars Sorted by Price (Descending):\n", sorted_by_price.head().to_markdown(index=False, numalign="left", stralign="left"))

**Challenge 4:**

1. Sort the DataFrame by `Mileage (mpg)` in ascending order and save it as a new CSV file called "sorted_cars.csv".
2. Display the first 5 rows of the sorted DataFrame.

In [None]:
# Write solution here.

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/wbuchanan/StataJSON">https://github.com/wbuchanan/StataJSON</a></li>
  </ol>
</div>