In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("w06Alab.ipynb")

---

<h3><center>E7 -  Introduction to Programming for Scientists and Engineers</center></h3>

<h2><center>Lab session: Week 6-A <br></center></h2>

<h1><center>Files<br></center></h1>

---

In [None]:
from resources.hashutils import *
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import csv

# Part 1: The [`pathlib`](https://docs.python.org/3/library/pathlib.html) module

## Part 1.1

Use the `pathlib` module to create the folder structure shown below. 

```
cwd
├── f1.txt
├── d1
│   ├── f1.csv
│   └── f2.csv
└── d2
```

Here "cwd" refers to the current working directory, returned by `Path.cwd()`. `d1` and `d2` are directories. `f1.txt`, `f1.csv`, and `f2.csv` are empty files. 

**Hints**:
+ Create a directory with [`Path.mkdir()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir)
+ Create a file with [`Path.touch()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.touch)
+ Use [`Path.exists()`](https://docs.python.org/3/library/pathlib.html#pathlib.Path.exists) to check that the path does not already exist before callling `mkdir()` or `touch()`.

In [None]:
...

In [None]:
grader.check("p1p1")

## Part 1.2

Write a function called `count_file_types` that takes as input a folder (a `Path` object), and returns a dictionary with the number of files of each type in the folder. For example, if the folder contains two txt files and three pdf files, then the function should return the following dictionary,
```python
{'.txt':2, '.pdf':3}
```
**Hints**:
+ Use `for p in folder.iterdir()` to iterate through the items in the folder. 
+ Use `is_file()` to check that an item is a file (and not a subdirectory).
+ Get a file's extension with `p.suffix`.
+ Check whether a string `a` is a key in a dictionary `A` with `a in A.keys()`.

In [None]:
def count_file_types(directory):
    ...

In [None]:
# Test your code on the resources folder for this lab.
# This folder contains 1 .py file, 2 .csv files, and 3 .png files.
count_file_types(Path.cwd()/'resources')

In [None]:
grader.check("p1p2")

# Part 2: Reading and writing files with NumPy

In this part we will load and perform a simple analysis of air quality data consisting of measurements of two sizes of inhalable pollutants: PM2.5 and PM10. PM2.5 are fine particles that travel deep into the lungs and bloodstream, and can increase the risk for heart and lung disease. PM10 are coarse dust particles that may cause upper respiratory irritation. 

## Part 2.1: Read a dataset using NumPy

The data is contained in a file called `air_quality_data.csv` in the `resources` folder.  Here is a time series plot of the data contained in the file. 

<center><img src="resources/aqi.png" width="500" /></center>

And here is a snapshot of the first few rows of the file itself. 

<center><img src="resources/aqi_table.png" width="220" /></center>


Notice that the first row is a header. Your first task is to load the data (ignoring the header) into a single two-dimensional NumPy array called `air_data`. Use `np.loadtxt`'s input arguments to specify the comma (`,`) as the delimiter, and to skip the header row. Please consult the documentation of [`np.loadtxt`](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) to see how to do this. 

In [None]:
datafile = ...
air_data = ...

In [None]:
grader.check("p2p1")

## Part 2.2

Save each of the columns of `air_data` to separate NumPy binary files in your current workding directory. Call these `PM2p5.npy` and `PM10.npy` for PM2.5 and PM10 respectively. 

In [None]:
...

In [None]:
grader.check("p2p2")

## Part 2.3

Compute the mean of the PM2.5 values and save it to a variable called `PM2p5_mean`.

Compute the mean of the PM10 values and save it to a variable called `PM10_mean`.


In [None]:
...

In [None]:
grader.check("p2p3")

## Part 2.4 (done already)

The next cell will create a scatter plot of PM2.5 and PM10 measurements. The mean value is represented as a red dot.  

In [None]:
fig, ax = plt.subplots(figsize=(5,4))
ax.scatter(air_data[:,0],air_data[:,1],marker='.',label='measurements')
ax.plot(PM2p5_mean,PM10_mean,'ro',label='sample mean')
ax.set_xlabel('PM2.5',fontsize=14)
ax.set_ylabel('PM10',fontsize=14)
ax.legend()

## Part 2.5

We can appreciate in the previous plot that PM2.5 and PM10 tend to increase and decrease together. This tendency can be quantified with the so-called "correlation coefficient" $\rho$ between the two quantities. The correlation coefficient varies between -1 and 1. A positive $\rho$ indicates an "increasing" relationship (as in the figure). A negative $\rho$ indicates a "decreasing" relationship where one quantity tends to decrease as the other increases. We will not cover the mathematical details of the correlation coefficient in this course.

Compute the correlation between PM2.5 and PM10. This can be done with the [`corrcoef`](https://numpy.org/doc/2.3/reference/generated/numpy.corrcoef.html) of NumPy. 

Hint: When you provide two arrays to `corrcoef`, it returns a symmetric 2×2 matrix (i.e. a 2D NumPy array). The correlation between PM2.5 and PM10 is the value in position [0,1] or [1,0].

In [None]:
...
rho = ...

In [None]:
grader.check("p2p5")

## Part 2.6 (done already)

The "linear regression line" is the straight line that best fits the data (in the sense of least square error). It is determined by these two properties:
+ The linear regression line goes through the mean. 
+ The slope of the liner regression line is the correlation coefficient. 

Having computed these two quantities, you now have everything you need to draw the linear regression line. We provide the plot for you below, and we will cover plotting techniques in Python next week. 

In [None]:
x = np.linspace(0,250)
linreg = (x-PM2p5_mean)*rho + PM10_mean

fig, ax = plt.subplots(figsize=(5,4))
ax.scatter(air_data[:,0],air_data[:,1],marker='.',label='measurements')
ax.plot(PM2p5_mean,PM10_mean,'ro',label='sample mean')
ax.plot(x,linreg,'m',label='linear regression model')
ax.set_xlabel('PM2.5',fontsize=14)
ax.set_ylabel('PM10',fontsize=14)
ax.legend()

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Make sure you submit the .zip file to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)