# 📊 Visualizing Data with Pandas & Matplotlib

In this section, we explore **basic data visualizations** using `pandas` and `matplotlib.pyplot`. These tools help us uncover trends, distributions, and relationships in our data.

---

## 🔧 Setup: Using Matplotlib with Pandas

Before plotting, we need to import `matplotlib.pyplot`:

```python
import matplotlib.pyplot as plt
````

Just like we alias pandas as `pd`, we alias matplotlib.pyplot as `plt`.

---

## 📈 1. Line Plots: Tracking Changes Over Time

Line plots are ideal for showing **how a numeric variable changes over time**.

### Example: Sully’s Weight Over the Year

```python
sully.head()
```

| date       | weight\_kg |
| ---------- | ---------- |
| 2019-01-31 | 36.1       |
| 2019-02-28 | 35.3       |
| 2019-03-31 | 32.0       |
| 2019-04-30 | 32.9       |
| 2019-05-31 | 32.0       |

```python
sully.plot(x="date", y="weight_kg", kind="line")
plt.show()
```

### 📌 Tip: Rotate axis labels to improve readability

```python
sully.plot(x="date", y="weight_kg", kind="line", rot=45)
plt.show()
```

---

## 📊 2. Scatter Plots: Visualizing Relationships

Use scatter plots to visualize the **relationship between two numeric variables**.

### Example: Height vs. Weight of Dogs

```python
dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
plt.show()
```

💡 **Insight:** Taller dogs tend to weigh more — visible as a rising trend in the scatter plot.

---

## 📉 3. Histograms: Visualizing Distributions

Histograms show the **distribution of a numeric variable** by grouping data into bins.

### Example: Distribution of Dog Heights

```python
dog_pack["height_cm"].hist()
plt.show()
```

* X-axis: height ranges
* Y-axis: number of dogs per range

#### Customize bin size:

```python
dog_pack["height_cm"].hist(bins=20)
plt.show()
```

---

## 🧱 4. Bar Plots: Comparing Groups

Bar plots are used to compare **categorical vs numeric data**.

### Example: Average Weight by Breed

```python
breed_weights = dog_pack.groupby("breed")["weight_kg"].mean()
breed_weights.plot(kind="bar", title="Average Weight by Dog Breed")
plt.show()
```

🗯️ **Insight:** Saint Bernards are the heaviest breed on average.

---

## 🧙 5. Layering Plots: Comparing Subgroups

You can overlay multiple plots to compare subgroups, like male vs female dogs.

```python
dog_pack[dog_pack["sex"] == "F"]["height_cm"].hist()
dog_pack[dog_pack["sex"] == "M"]["height_cm"].hist()
plt.show()
```

🙈 Problem: We can't tell which color represents which group.

---

## 🏷️ 6. Adding a Legend

Use `plt.legend()` to label plots:

```python
dog_pack[dog_pack["sex"] == "F"]["height_cm"].hist()
dog_pack[dog_pack["sex"] == "M"]["height_cm"].hist()
plt.legend(["F", "M"])
plt.show()
```

---

## 🌫️ 7. Adding Transparency (Alpha)

Use the `alpha` argument to make histograms translucent and avoid overlap issues:

```python
dog_pack[dog_pack["sex"] == "F"]["height_cm"].hist(alpha=0.7)
dog_pack[dog_pack["sex"] == "M"]["height_cm"].hist(alpha=0.7)
plt.legend(["F", "M"])
plt.show()
```

* `alpha=0`: completely transparent (invisible)
* `alpha=1`: fully opaque

---

## 🥑 8. The Avocados Dataset (Coming Up)

You’ll work with a real dataset of weekly US avocado sales with fields like:

* `date`
* `type` (conventional or organic)
* `year`
* `avg_price`
* `size`
* `nb_sold` (number of avocados sold)

```python
print(avocados.head())
```

| date       | type         | year | avg\_price | size  | nb\_sold |
| ---------- | ------------ | ---- | ---------- | ----- | -------- |
| 2015-12-27 | conventional | 2015 | 0.95       | small | 962690.1 |
| 2015-12-20 | conventional | 2015 | 0.98       | small | 871002.1 |
| ...        | ...          | ...  | ...        | ...   | ...      |

---

## ✅ Summary: Key Plot Types

| Plot Type    | Purpose                                  | Key Function Call                     |
| ------------ | ---------------------------------------- | ------------------------------------- |
| Line Plot    | Track change over time                   | `.plot(x=..., y=..., kind="line")`    |
| Histogram    | Show distribution of numeric variable    | `.hist()`                             |
| Bar Plot     | Compare categories                       | `.plot(kind="bar")`                   |
| Scatter Plot | Visualize relationship between variables | `.plot(x=..., y=..., kind="scatter")` |
| Layered Plot | Compare subgroups visually               | Multiple `.hist()` + `plt.legend()`   |
| Transparency | Avoid overlap in layered plots           | `alpha=...` in `.hist()`              |

---

🧠 **Takeaways**:

* Use the right plot for the right insight.
* Clean, labeled plots (with rotated labels, legends, transparency) help your audience understand your data quickly.
* Pivot tables and grouped statistics can power bar plots effectively.
* You're ready to visualize real datasets like `avocados`!

---



In [None]:
# Exercise
# Which avocado size is most popular?
# Avocados are increasingly popular and delicious in guacamole and on toast. The Hass Avocado Board keeps track of avocado supply and demand across the USA, including the sales of three different sizes of avocado. In this exercise, you'll use a bar plot to figure out which size is the most popular.

# Bar plots are great for revealing relationships between categorical (size) and numeric (number sold) variables, but you'll often have to manipulate your data first in order to get the numbers you need for plotting.

# pandas has been imported as pd, and avocados is available.

# Instructions
# 100 XP
# Print the head of the avocados dataset. What columns are available?
# For each avocado size group, calculate the total number sold, storing as nb_sold_by_size.
# Create a bar plot of the number of avocados sold by size.
# Show the plot.

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Look at the first few rows of data
print(avocados.head())

# Get the total number of avocados sold of each size
nb_sold_by_size = avocados.groupby('size')['nb_sold'].sum()

# Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(kind="bar")

# Show the plot
plt.show()

#          date          type  year  avg_price   size     nb_sold
# 0  2015-12-27  conventional  2015       0.95  small  9626901.09
# 1  2015-12-20  conventional  2015       0.98  small  8710021.76
# 2  2015-12-13  conventional  2015       0.93  small  9855053.66
# 3  2015-12-06  conventional  2015       0.89  small  9405464.36
# 4  2015-11-29  conventional  2015       0.99  small  8094803.56

In [None]:
# Changes in sales over time
# Line plots are designed to visualize the relationship between two numeric variables, where each data values is connected to the next one. They are especially useful for visualizing the change in a number over time since each time point is naturally connected to the next time point. In this exercise, you'll visualize the change in avocado sales over three years.

# pandas has been imported as pd, and avocados is available.

# Instructions
# 100 XP
# Get the total number of avocados sold on each date. The DataFrame has two rows for each date—one for organic, and one for conventional. Save this as nb_sold_by_date.
# Create a line plot of the number of avocados sold.
# Show the plot.

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Get the total number of avocados sold on each date
nb_sold_by_date = avocados.groupby('date')['nb_sold'].sum()

# Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(x='date', y= 'nb_sold', kind='line')

# Show the plot
plt.show()

In [None]:
# Exercise
# Avocado supply and demand
# Scatter plots are ideal for visualizing relationships between numerical variables. In this exercise, you'll compare the number of avocados sold to average price and see if they're at all related. If they're related, you may be able to use one number to predict the other.

# matplotlib.pyplot has been imported as plt, pandas has been imported as pd, and avocados is available.

# Instructions
# 100 XP
# Create a scatter plot with nb_sold on the x-axis and avg_price on the y-axis. Title it "Number of avocados sold vs. average price".
# Show the plot.

# Scatter plot of avg_price vs. nb_sold with title
avocados.plot(x = 'nb_sold', y = 'avg_price', kind = 'scatter', title = 'Number of avocados sold vs. average price')

# Show the plot
plt.show()

In [None]:
# Exercise
# Price of conventional vs. organic avocados
# Creating multiple plots for different subsets of data allows you to compare groups. In this exercise, you'll create multiple histograms to compare the prices of conventional and organic avocados.

# matplotlib.pyplot has been imported as plt and pandas has been imported as pd.

# Instructions 1/3
# 35 XP
# Subset avocados for the "conventional" type and create a histogram of the avg_price column.
# Create a histogram of avg_price for "organic" type avocados.
# Add a legend to your plot, with the names "conventional" and "organic".
# Show your plot.

# Histogram of conventional avg_price 
avocados[avocados['type'] == 'conventional']['avg_price'].hist()

# Histogram of organic avg_price
avocados[avocados['type'] == 'organic']['avg_price'].hist()

# Add a legend
plt.legend(['conventional', 'organic'])

# Show the plot
plt.show()


Instructions 2/3
35 XP
2. Modify your code to adjust the transparency of both histograms to 0.5 to see how much overlap there is between the two distributions.

# Modify histogram transparency to 0.5 
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5)

# Modify histogram transparency to 0.5
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()



Instructions 3/3
30 XP
3. Modify your code to use 20 bins in both histograms.

# Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins=20)

# Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5,bins=20)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

# 🧹 Handling Missing Values in pandas

Missing values are a common issue in real-world datasets. In this section, we’ll learn how pandas helps us **detect**, **count**, **visualize**, and **handle** missing data effectively.

---

## 🔍 What Is a Missing Value?

- Real datasets are rarely complete — sometimes data is missing due to errors, equipment failure, or incomplete records.
- In pandas, **missing values are represented as `NaN`** (Not a Number).

**Example:**
```python
# Sample DataFrame with missing values in 'weight_kg'
print(dogs)
````

| name    | breed       | color | height\_cm | weight\_kg | date\_of\_birth |
| ------- | ----------- | ----- | ---------- | ---------- | --------------- |
| Bella   | Labrador    | Brown | 56         | NaN        | 2013-07-01      |
| Charlie | Poodle      | Black | 43         | 24.0       | 2016-09-16      |
| Lucy    | Chow Chow   | Brown | 46         | 24.0       | 2014-08-25      |
| Cooper  | Schnauzer   | Gray  | 49         | NaN        | 2011-12-11      |
| Max     | Labrador    | Black | 59         | 29.0       | 2017-01-20      |
| Stella  | Chihuahua   | Tan   | 18         | 2.0        | 2015-04-20      |
| Bernie  | St. Bernard | White | 77         | 74.0       | 2018-02-27      |

📌 **Note:** Bella and Cooper are missing their weight entries.

---

## ✅ Detecting Missing Values with `.isna()`

```python
dogs.isna()
```

* This returns a **DataFrame of the same shape** with `True` where values are missing (`NaN`) and `False` otherwise.

**Output:**

| name  | breed | color | height\_cm | weight\_kg | date\_of\_birth |
| ----- | ----- | ----- | ---------- | ---------- | --------------- |
| False | False | False | False      | **True**   | False           |
| False | False | False | False      | False      | False           |
| False | False | False | False      | False      | False           |
| False | False | False | False      | **True**   | False           |
| False | False | False | False      | False      | False           |
| False | False | False | False      | False      | False           |
| False | False | False | False      | False      | False           |

---

## ❓ Are There Any Missing Values Per Column?

```python
dogs.isna().any()
```

* Checks **column-wise** if **any values are missing**.
* Useful for a quick overview.

**Output:**

```
name             False
breed            False
color            False
height_cm        False
weight_kg         True   ← missing data present here
date_of_birth    False
dtype: bool
```

---

## 🔢 How Many Values Are Missing?

```python
dogs.isna().sum()
```

* Sums `True` values (which are treated as 1) for each column.
* Tells you **how many values are missing**.

**Output:**

```
name             0
breed            0
color            0
height_cm        0
weight_kg        2   ← two values missing
date_of_birth    0
dtype: int64
```

---

## 📊 Visualizing Missing Data

```python
dogs.isna().sum().plot(kind="bar")
plt.show()
```

* A **bar plot** can help visualize where the missing data is.
* Great for spotting missing values across many columns.

📌 In this case, since only `weight_kg` has missing values, the bar chart shows a single bar.

---

## ✂️ Removing Rows with Missing Data

```python
dogs.dropna()
```

* Removes **any row** that contains **at least one NaN**.
* Useful if missing data is rare.
* ❗ Use with caution — dropping rows can reduce dataset size drastically.

---

## 🔄 Replacing Missing Values

```python
dogs.fillna(0)
```

* Replaces **all NaNs with the value provided** (here, 0).
* Useful when you want to preserve rows but make NaNs meaningful.
* Can also use mean, median, etc., for more sophisticated imputation.

---

## 🧠 Key Takeaways

* `NaN` stands for “Not a Number” and represents missing data in pandas.
* Use `.isna()` to detect missing data.
* Chain `.isna().sum()` or `.isna().any()` for quick summaries.
* Use `.dropna()` to remove rows with missing values.
* Use `.fillna()` to replace missing values.

---

## ✅ Summary

| Method             | Purpose                             |
| ------------------ | ----------------------------------- |
| `df.isna()`        | Check which values are missing      |
| `df.isna().any()`  | See if each column has any missing  |
| `df.isna().sum()`  | Count missing values per column     |
| `df.dropna()`      | Remove rows with any missing values |
| `df.fillna(value)` | Replace missing values with a value |

---

```
```


In [None]:
# Exercise
# Finding missing values
# Missing values are everywhere, and you don't want them interfering with your work. Some functions ignore missing data by default, but that's not always the behavior you might want. Some functions can't handle missing values at all, so these values need to be taken care of before you can use them. If you don't know where your missing values are, or if they exist, you could make mistakes in your analysis. In this exercise, you'll determine if there are missing values in the dataset, and if so, how many.

# pandas has been imported as pd and avocados_2016, a subset of avocados that contains only sales from 2016, is available.

# Instructions
# 100 XP
# Print a DataFrame that shows whether each value in avocados_2016 is missing or not.
# Print a summary that shows whether any value in each column is missing or not.
# Create a bar plot of the total number of missing values in each column.

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Check individual values for missing values
print(avocados_2016.isna().sum())

# Check each column for missing values
print(avocados_2016.isna().any())

# Bar plot of missing values by variable
avocados_2016.isna().sum().plot(kind='bar')

# Show plot
plt.show()



In [None]:
# Exercise
# Removing missing values
# Now that you know there are some missing values in your DataFrame, you have a few options to deal with them. One way is to remove them from the dataset completely. In this exercise, you'll remove missing values by removing all rows that contain missing values.

# pandas has been imported as pd and avocados_2016 is available.

# Instructions
# 100 XP
# Remove the rows of avocados_2016 that contain missing values and store the remaining rows in avocados_complete.
# Verify that all missing values have been removed from avocados_complete. Calculate each column that has NAs and print.


# Remove rows with missing values
avocados_complete = avocados_2016.dropna()

# Check if any columns contain missing values
print(avocados_complete.isna().any())

# <script.py> output:
#     date               False
#     avg_price          False
#     total_sold         False
#     small_sold         False
#     large_sold         False
#     xl_sold            False
#     total_bags_sold    False
#     small_bags_sold    False
#     large_bags_sold    False
#     xl_bags_sold       False
#     dtype: bool

In [None]:
# Exercise
# Replacing missing values
# Another way of handling missing values is to replace them all with the same value. For numerical variables, one option is to replace values with 0— you'll do this here. However, when you replace missing values, you make assumptions about what a missing value means. In this case, you will assume that a missing number sold means that no sales for that avocado type were made that week.

# In this exercise, you'll see how replacing missing values can affect the distribution of a variable using histograms. You can plot histograms for multiple variables at a time as follows:

# dogs[["height_cm", "weight_kg"]].hist()
# pandas has been imported as pd and matplotlib.pyplot has been imported as plt. The avocados_2016 dataset is available.

# Instructions 1/2
# 50 XP
# 1. A list has been created, cols_with_missing, containing the names of columns with missing values: "small_sold", "large_sold", and "xl_sold".
# Create a histogram of those columns.
# Show the plot.

# List the columns with missing values
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]

# Create histograms showing the distributions cols_with_missing
avocados_2016[cols_with_missing].hist()

# Show the plot
plt.show()


# Instructions 2/2
# 50 XP
# 2
# Replace the missing values of avocados_2016 with 0s and store the result as avocados_filled.
# Create a histogram of the cols_with_missing columns of avocados_filled.

# From previous step
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]
avocados_2016[cols_with_missing].hist()
plt.show()

# Fill in missing values with 0
avocados_filled = avocados_2016.fillna(0)

# Create histograms of the filled columns
avocados_filled[cols_with_missing].hist()

# Show the plot
plt.show()

# 🐼 Creating DataFrames in pandas

Creating your own DataFrame from scratch is a fundamental step in data manipulation using pandas. This section covers two main methods:
- From a **list of dictionaries** (row-wise construction)
- From a **dictionary of lists** (column-wise construction)

Before we jump into that, let’s start with a quick refresher on Python dictionaries.

---

## 📚 1. Understanding Python Dictionaries

A **dictionary** in Python is a data structure that stores data as key-value pairs. It looks like this:

```python
my_dict = {
    "key1": value1,
    "key2": value2,
    "key3": value3
}
````

* **Keys** are like labels (e.g., `"title"`, `"author"`)
* **Values** are the actual data associated with each key.

Example:

```python
my_dict = {
    "title": "Charlotte's Web",
    "author": "E.B. White",
    "published": 1952
}

print(my_dict["title"])
```

📤 **Output:**

```
Charlotte's Web
```

👉 This is foundational because pandas DataFrames can be built using these key-value structures.

---

## 🧱 2. Creating a DataFrame from a List of Dictionaries (Row-wise)

In this method, we construct the DataFrame **row by row**. Each dictionary represents one row, and each key-value pair represents a column and its value.

### 🐶 Data Example:

```python
list_of_dicts = [
    {
        "name": "Ginger",
        "breed": "Dachshund",
        "height_cm": 22,
        "weight_kg": 10,
        "date_of_birth": "2019-03-14"
    },
    {
        "name": "Scout",
        "breed": "Dalmatian",
        "height_cm": 59,
        "weight_kg": 25,
        "date_of_birth": "2019-05-09"
    }
]
```

* Each item in the list is a dictionary representing one dog’s data.
* Keys: column names (`name`, `breed`, etc.)
* Values: individual dog's attributes

### 📥 Converting to a DataFrame:

```python
new_dogs = pd.DataFrame(list_of_dicts)
print(new_dogs)
```

📤 **Output:**

```
     name      breed  height_cm  weight_kg date_of_birth
0  Ginger  Dachshund         22         10    2019-03-14
1   Scout  Dalmatian         59         25    2019-05-09
```

✅ This method is intuitive when your data is naturally row-oriented — such as reading from JSON records.

---

## 🧱 3. Creating a DataFrame from a Dictionary of Lists (Column-wise)

In this method, we define the DataFrame **column by column**. Each key is a column name, and each value is a list of entries in that column.

### 📊 Data Example:

```python
dict_of_lists = {
    "name": ["Ginger", "Scout"],
    "breed": ["Dachshund", "Dalmatian"],
    "height_cm": [22, 59],
    "weight_kg": [10, 25],
    "date_of_birth": ["2019-03-14", "2019-05-09"]
}
```

* Each list must be the **same length** — one value per row.
* The position in each list aligns across columns (e.g., first elements in all lists form the first row).

### 📥 Converting to a DataFrame:

```python
new_dogs = pd.DataFrame(dict_of_lists)
print(new_dogs)
```

📤 **Output:**

```
     name      breed  height_cm  weight_kg date_of_birth
0  Ginger  Dachshund         22         10    2019-03-14
1   Scout  Dalmatian         59         25    2019-05-09
```

✅ This is often useful when collecting column-wise data (e.g., manual entry or from APIs).

---

## 📝 Summary

| Method               | Structure Type | Built From             | Example Object  |
| -------------------- | -------------- | ---------------------- | --------------- |
| List of dictionaries | Row-wise       | One dictionary per row | `list_of_dicts` |
| Dictionary of lists  | Column-wise    | One list per column    | `dict_of_lists` |

Both methods are valid — choose based on how your raw data is structured.

---




In [None]:
# Exercise
# List of dictionaries
# You recently got some new avocado data from 2019 that you'd like to put in a DataFrame using the list of dictionaries method. Remember that with this method, you go through the data row by row.

# |     date      | small_sold | large_sold |
# |---------------|------------|------------|
# | 2019-11-03    | 10376832   |  7835071   |
# | 2019-11-10    | 10717154   |  8561348   |

# pandas as pd is imported.

# Instructions
# 100 XP
# Create a list of dictionaries with the new data called avocados_list.
# Convert the list into a DataFrame called avocados_2019.
# Print your new DataFrame.


# Create a list of dictionaries with new data
avocados_list = [
    { 'date': '2019-11-03', 'small_sold': 10376832, 'large_sold': 7835071},
    {'date': '2019-11-10', 'small_sold': 10717154, 'large_sold': 8561348}
]

# Convert list into DataFrame
avocados_2019 = pd.DataFrame(avocados_list)

# Print the new DataFrame
print(avocados_2019)


# <script.py> output:
#              date  small_sold  large_sold
#     0  2019-11-03    10376832     7835071
#     1  2019-11-10    10717154     8561348

In [None]:
# Exercise
# Dictionary of lists
# Some more data just came in! This time, you'll use the dictionary of lists method, parsing the data column by column.

# |     date      | small_sold | large_sold |
# |---------------|------------|------------|
# | 2019-11-17    | 10859987   |  7674135   |
# | 2019-12-01    |  9291631   |  6238096   |

# # pandas as pd is imported.

# Instructions
# 100 XP
# Create a dictionary of lists with the new data called avocados_dict.
# Convert the dictionary to a DataFrame called avocados_2019.
# Print your new DataFrame.


# Create a dictionary of lists with new data
avocados_dict = {
  "date": ['2019-11-17', '2019-12-01'],
  "small_sold": [10859987, 9291631],
  "large_sold": [7674135, 6238096]
}

# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)

# Print the new DataFrame
print(avocados_2019)



# <script.py> output:
#              date  small_sold  large_sold
#     0  2019-11-17    10859987     7674135
#     1  2019-12-01     9291631     6238096

---

# 📊 Reading and Writing CSVs with pandas

> In this section, you'll learn how to load real-world tabular data from a CSV file into a pandas DataFrame, manipulate that data, and save your results back to a new CSV file. CSVs are one of the most common file formats in data analysis and work great with pandas.

---

## 🔹 What is a CSV File?

* **CSV** stands for **Comma-Separated Values**.
* It's a **plain text** format designed to store **tabular data** (rows and columns).
* Each line in a CSV represents a row, and values are separated by commas.
* The first line usually contains **column headers**.
* CSVs are:

  * **Lightweight**
  * **Human-readable**
  * **Compatible** with Excel, databases, Python, R, and many other tools.
* This makes CSV files an ideal format for **data exchange** between different platforms.

> 💡 Think of a CSV as a simpler Excel sheet saved as text.

---

## 📄 Example CSV File: `new_dogs.csv`

Here’s a sample file:

```
name,breed,height_cm,weight_kg,d_o_b
Ginger,Dachshund,22,10,2019-03-14
Scout,Dalmatian,59,25,2019-05-09
```

* Each line is a new dog's record.
* Columns represent: name, breed, height (in cm), weight (in kg), and date of birth.

> This file is ready to be loaded into a pandas DataFrame!

---

## 📥 Loading a CSV into a pandas DataFrame

```python
import pandas as pd
new_dogs = pd.read_csv("new_dogs.csv")
print(new_dogs)
```

### 🔍 Explanation:

* `import pandas as pd`: Loads the pandas library and gives it the nickname `pd` (a standard convention).
* `pd.read_csv(...)`: Reads the contents of the CSV file and loads it into a DataFrame.
* `"new_dogs.csv"`: The name of the file we want to read (must be in the same folder or provide the path).
* `print(new_dogs)`: Displays the resulting DataFrame.

### ✅ Output:

```
    name      breed  height_cm  weight_kg  d_o_b
0  Ginger  Dachshund         22         10  2019-03-14
1   Scout   Dalmatian         59         25  2019-05-09
```

> The CSV data is now available in a structured format (DataFrame) for further analysis.

---

## 🧠 Manipulating the DataFrame: Add BMI Column

```python
new_dogs["bmi"] = new_dogs["weight_kg"] / (new_dogs["height_cm"] / 100) ** 2
print(new_dogs)
```

### 🔍 Explanation:

* `new_dogs["bmi"] = ...` creates a new column called `"bmi"` in the DataFrame.
* `weight_kg / (height_cm / 100)^2` is the formula for **Body Mass Index (BMI)**:

  * Weight in kilograms divided by height in meters squared.
* We divide `height_cm` by 100 to convert to meters.

### ✅ Output:

```
    name      breed  height_cm  weight_kg     d_o_b         bmi
0  Ginger  Dachshund         22         10  2019-03-14  206.611570
1   Scout   Dalmatian         59         25  2019-05-09   71.818443
```

> 🧪 Insight: Ginger’s BMI is quite high due to small height and relatively high weight.

---

## 💾 Saving the DataFrame to a New CSV

```python
new_dogs.to_csv("new_dogs_with_bmi.csv")
```

### 🔍 Explanation:

* `to_csv(...)`: Writes the DataFrame to a new CSV file.
* `"new_dogs_with_bmi.csv"`: The name of the new output file that will contain the BMI column.

### ✅ Result:

The file `new_dogs_with_bmi.csv` is created, and it looks like this:

```
,name,breed,height_cm,weight_kg,d_o_b,bmi
0,Ginger,Dachshund,22,10,2019-03-14,206.611570
1,Scout,Dalmatian,59,25,2019-05-09,71.818443
```

> 🗃️ Now this file can be shared with others, emailed, or opened in Excel — including the new BMI column.

---

## ✅ Recap: Key Techniques

| Task                        | pandas Method             | Description                                     |
| --------------------------- | ------------------------- | ----------------------------------------------- |
| **Read CSV into DataFrame** | `pd.read_csv()`           | Loads tabular data into a structured DataFrame. |
| **Manipulate Data**         | `DataFrame[column] = ...` | Add new columns using math/logic operations.    |
| **Save DataFrame to CSV**   | `DataFrame.to_csv()`      | Saves updated DataFrame to a new CSV file.      |

---

## 🎯 Summary

* CSVs are universal files for storing spreadsheet-like data.
* pandas makes it easy to load and manipulate CSVs.
* You can calculate new fields like BMI using basic math on columns.
* After modifications, you can export the updated data back to a new CSV.

---

> 💬 **Pro Tip**: Always check the new file after saving to confirm that it includes the changes you expect.

---


In [None]:
# Exercise
# CSV to DataFrame
# You work for an airline, and your manager has asked you to do a competitive analysis and see how often passengers flying on other airlines are involuntarily bumped from their flights. You got a CSV file (airline_bumping.csv) from the Department of Transportation containing data on passengers that were involuntarily denied boarding in 2016 and 2017, but it doesn't have the exact numbers you want. In order to figure this out, you'll need to get the CSV into a pandas DataFrame and do some manipulation!

# pandas is imported for you as pd. "airline_bumping.csv" is in your working directory.

# Instructions 1/4
# 25 XP
# 1. Read the CSV file "airline_bumping.csv" and store it as a DataFrame called airline_bumping.
# Print the first few rows of airline_bumping.


# Read CSV as DataFrame called airline_bumping
airline_bumping = pd.read_csv('airline_bumping.csv')

# Take a look at the DataFrame
print(airline_bumping)


# <script.py> output:
#                  airline  year  nb_bumped  total_passengers
#     0    DELTA AIR LINES  2017        679          99796155
#     1     VIRGIN AMERICA  2017        165           6090029
#     2    JETBLUE AIRWAYS  2017       1475          27255038
#     3    UNITED AIRLINES  2017       2067          70030765
#     4  HAWAIIAN AIRLINES  2017         92           8422734



# Instructions 2/4
# 25 XP
# 2.For each airline group, select the nb_bumped, and total_passengers columns, and calculate the sum (for both years). Store this as airline_totals.


# From previous step
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())

# For each airline, select nb_bumped and total_passengers and sum
airline_totals = airline_bumping.groupby('airline')[['nb_bumped','total_passengers']].sum()



# Instructions 3/4
# 25 XP
# 3. Create a new column of airline_totals called bumps_per_10k, which is the number of passengers bumped per 10,000 passengers in 2016 and 2017.

# From previous steps
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())
airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()

# Create new col, bumps_per_10k: no. of bumps per 10k passengers for each airline
airline_totals["bumps_per_10k"] =  (airline_totals['nb_bumped']/ airline_totals['total_passengers']) * 10000

# <script.py> output:
#                  airline  year  nb_bumped  total_passengers
#     0    DELTA AIR LINES  2017        679          99796155
#     1     VIRGIN AMERICA  2017        165           6090029
#     2    JETBLUE AIRWAYS  2017       1475          27255038
#     3    UNITED AIRLINES  2017       2067          70030765
#     4  HAWAIIAN AIRLINES  2017         92           8422734
#                          nb_bumped  total_passengers  bumps_per_10k
#     airline                                                        
#     ALASKA AIRLINES           1392          36543121       0.380920
#     AMERICAN AIRLINES        11115         197365225       0.563169
#     DELTA AIR LINES           1591         197033215       0.080748
#     EXPRESSJET AIRLINES       3326          27858678       1.193883
#     FRONTIER AIRLINES         1228          22954995       0.534960
#     HAWAIIAN AIRLINES          122          16577572       0.073593
#     JETBLUE AIRWAYS           3615          53245866       0.678926
#     SKYWEST AIRLINES          3094          47091737       0.657015
#     SOUTHWEST AIRLINES       18585         228142036       0.814624
#     SPIRIT AIRLINES           2920          32304571       0.903897
#     UNITED AIRLINES           4941         134468897       0.367446
#     VIRGIN AMERICA             242          12017967       0.201365



# Instructions 4/4
# 25 XP
# 4. Print airline_totals to see the results of your manipulations.

# From previous steps
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())
airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

# Print airline_totals
print(airline_totals)


In [None]:
# Exercise
# DataFrame to CSV
# You're almost there! To make things easier to read, you'll need to sort the data and export it to CSV so that your colleagues can read it.

# pandas as pd has been imported for you.

# Instructions
# 100 XP
# Sort airline_totals by the values of bumps_per_10k from highest to lowest, storing as airline_totals_sorted.
# Print your sorted DataFrame.
# Save the sorted DataFrame as a CSV called "airline_totals_sorted.csv".

# Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values('bumps_per_10k', ascending=False)

# Print airline_totals_sorted
print(airline_totals_sorted)

# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv("airline_totals_sorted.csv")


# <script.py> output:
#                                                         airline  nb_bumped  total_passengers  bumps_per_10k
#     airline                                                                                                
#     EXPRESSJET AIRLINES  EXPRESSJET AIRLINESEXPRESSJET AIRLINES       3326          27858678       1.193883
#     SPIRIT AIRLINES              SPIRIT AIRLINESSPIRIT AIRLINES       2920          32304571       0.903897
#     SOUTHWEST AIRLINES     SOUTHWEST AIRLINESSOUTHWEST AIRLINES      18585         228142036       0.814624
#     JETBLUE AIRWAYS              JETBLUE AIRWAYSJETBLUE AIRWAYS       3615          53245866       0.678926
#     SKYWEST AIRLINES           SKYWEST AIRLINESSKYWEST AIRLINES       3094          47091737       0.657015
#     AMERICAN AIRLINES        AMERICAN AIRLINESAMERICAN AIRLINES      11115         197365225       0.563169
#     FRONTIER AIRLINES        FRONTIER AIRLINESFRONTIER AIRLINES       1228          22954995       0.534960
#     ALASKA AIRLINES              ALASKA AIRLINESALASKA AIRLINES       1392          36543121       0.380920
#     UNITED AIRLINES              UNITED AIRLINESUNITED AIRLINES       4941         134468897       0.367446
#     VIRGIN AMERICA                 VIRGIN AMERICAVIRGIN AMERICA        242          12017967       0.201365
#     DELTA AIR LINES              DELTA AIR LINESDELTA AIR LINES       1591         197033215       0.080748
#     HAWAIIAN AIRLINES        HAWAIIAN AIRLINESHAWAIIAN AIRLINES        122          16577572       0.073593



# 📘 Data Manipulation with pandas — Wrap-up & Recap

---

## ✅ Congratulations!

> This course covered the fundamental building blocks to load, transform, and analyze tabular data using the powerful `pandas` library in Python.

---

## 📚 Chapter-wise Recap

---

### 📖 Chapter 1: Subsetting and Sorting

#### 🔹 Key Concepts:

* Accessing specific rows and columns in a DataFrame.
* Sorting values by specific columns.
* Creating new columns using vectorized operations.

#### 💡 Why it matters:

Subsetting and sorting are essential to **isolate relevant data**, **clean it**, or **prepare it for analysis**.

---

### 📖 Chapter 2: Aggregating and Grouping

#### 🔹 Key Concepts:

* Using `.mean()`, `.min()`, `.max()`, `.count()`, etc., to get **summary statistics**.
* Using `.groupby()` to **aggregate data across categories**.

#### 💡 Why it matters:

Aggregation lets you **summarize** large datasets into meaningful insights (e.g., averages by category, totals by year, etc.).

---

### 📖 Chapter 3: Indexing and Slicing

#### 🔹 Key Concepts:

* Setting the index with `.set_index()` to **optimize lookup and slicing**.
* Using `.loc[]` and `.iloc[]` for **label-based** and **position-based** slicing.
* Multi-level indexing and slicing by tuples.

#### 💡 Why it matters:

Indexing helps you write **cleaner, faster, and more readable** data access code. It simplifies filtering and enables powerful subsetting.

---

### 📖 Chapter 4: Visualization & File I/O

#### 🔹 Key Concepts:

* Creating plots directly from a DataFrame using `.plot()`.
* Reading and writing CSV files using `pd.read_csv()` and `.to_csv()`.

#### 💡 Why it matters:

* **Visualizations** help explore and communicate insights from data.
* **CSV I/O** is crucial for working with real-world datasets — it's how data comes in and out of pandas.

---

### 🔗 Joining DataFrames

* Real-world datasets are often split across multiple tables.
* **Merging**, **joining**, and **concatenating** DataFrames is essential for **relational-style analysis**.
* Pandas supports SQL-style joins using `.merge()` and `.join()`.

---

### 📥 Streamlined Data Ingestion

* Beyond CSVs, pandas can load data from:

  * Excel files (`.xlsx`)
  * SQL databases
  * JSON, HTML, Parquet, and many other formats
* Tools like `read_sql()` and `read_excel()` let you **connect to more complex data sources**.

---

### 📊 Exploratory Data Analysis (EDA)

* With more advanced pandas skills, you can:

  * Quickly spot trends, outliers, and patterns
  * Perform feature engineering
  * Clean and prepare data pipelines

---

## 🎓 Final Words

Congratulations again! You’ve now built a strong base in:

* 🧮 Tabular data manipulation
* 📈 Basic visualizations
* 📤 Reading and saving datasets

This foundation is **essential for data analysis, machine learning, and real-world data work**. Keep practicing, and explore more advanced pandas features as you go!

---

> 🧠 *Pro Tip:* The more you work with real data, the more fluent you'll become in pandas.

> 🚀 Ready to level up? Explore courses like:
>
> * 📌 *Joining Data with pandas*
> * 📌 *Streamlined Data Ingestion with pandas*
> * 📌 *Analyzing Police Activity or Marketing Campaigns with pandas*

---

### ✅ End of Course Notes


---
