# Lab 2.1 - Weather Data Around Winona

In this lab, we will download and combine a decades worth of weather data from the NOAA, focusing on weather stations within 500 miles of Winona.

Here is the outline of the basic process.

1. Install and investigate useful packages.
2. Find all weather stations in proximity to Winona.
3. Use a single station to prototype our tools.
4. Automate the process of downloading and uncompressing data from all stations of interest.
5. Output the results to a CSV file.

## Problem 1 - Install and investigate useful tools.

First, you should install and investigate the following tools.

1. **`wget`** is a tool for programmically downloading data files from the web on the command line.  There is a Python wrapper to this tool that you can install with `pip` as shown below.
2. **`geopy`** is a package that, among other things, implements a function for computing distances between two lat-long pairs. Again, install this package with `pip` as shown below.
3. **`gzip`** is part of the standard Python library and

In [15]:
%pip install wget

Note: you may need to restart the kernel to use updated packages.


In [16]:
%pip install geopy

Note: you may need to restart the kernel to use updated packages.


#### Task 1.1 - Investigate using `wget` to download a file.

Read the help/documentation on `wget` to figure out how to download the following data file [Some random data file from STAT 210] into the `./data` sub-folder.

[https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv](https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv)

In [1]:
# Your code here
import wget

In [18]:
?wget

[1;31mType:[0m        module
[1;31mString form:[0m <module 'wget' from 'C:\\Users\\hp6265bz\\AppData\\Local\\anaconda3\\envs\\polars\\Lib\\site-packages\\wget.py'>
[1;31mFile:[0m        c:\users\hp6265bz\appdata\local\anaconda3\envs\polars\lib\site-packages\wget.py
[1;31mDocstring:[0m  
Download utility as an easy way to get file from the net
 
  python -m wget <URL>
  python wget.py <URL>

Downloads: http://pypi.python.org/pypi/wget/
Development: http://bitbucket.org/techtonik/python-wget/

wget.py is not option compatible with Unix wget utility,
to make command line interface intuitive for new people.

Public domain by anatoly techtonik <techtonik@gmail.com>
Also available under the terms of MIT license
Copyright (c) 2010-2015 anatoly techtonik

In [2]:
url = "https://github.com/yardsale8/STAT_210/raw/refs/heads/main/data/sars1.csv"
output = "./data/sars1.csv"
file_name = wget.download(url, output)

#### Task 1.2 - Investigate using `geopy.distance.distance` to compute a distance in miles.

1. Import the `distance` function from the `geopy.distance` submodule.
2. Use Wikipedia to find the lat-long coordinates of Winona and Rochester MN.
3. Use `distance` to compute the distance between Winona and Rochester.
4. Use some other source (e.g., Google Maps) to check the answer.

In [3]:
# Your code here
from geopy.distance import distance

In [21]:
?distance

[1;31mInit signature:[0m [0mdistance[0m[1;33m([0m[1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Calculate the geodesic distance between points.

Set which ellipsoidal model of the earth to use by specifying an
``ellipsoid`` keyword argument. The default is 'WGS-84', which is the
most globally accurate model.  If ``ellipsoid`` is a string, it is
looked up in the `ELLIPSOIDS` dictionary to obtain the major and minor
semiaxes and the flattening. Otherwise, it should be a tuple with those
values.  See the comments above the `ELLIPSOIDS` dictionary for
more information.

Example::

    >>> from geopy.distance import geodesic
    >>> newport_ri = (41.49008, -71.312796)
    >>> cleveland_oh = (41.499498, -81.695391)
    >>> print(geodesic(newport_ri, cleveland_oh).miles)
    538.390445368
[1;31mInit docstring:[0m
There are 3 ways to create a distance:

- From kilometers::

    >>> from geopy.distance im

In [4]:
winona = (44.050556, -91.668333)
rochester = (44.023333, -92.461389)	
distance(winona, rochester).miles

39.54418575388878

#### Task 1.3 - Investigate `gzip`

The yearly NOAA data is compressed as `.gz` files, which need to be uncompressed using `gzip`.  Explore the `gzip` module by

1. Exploring the documentation/help for the `gzip` module,
2. Using `wget` to download the following link into the `./data` folder, and
3. Using `gzip` to uncompress this file.
4. Inspect the data in your list, which should be of type `byte`.  Use a comprehension with the expression `l.decode('utf-8')` to convert this to a list of strings.
5. Write the uncompressed lines to an output file using `with open(path, 'w') as out` and the `writelines` method of `out`.  

**Link.** [https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz](https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz)

In [5]:
# Your code here
import gzip

In [6]:
?gzip

[1;31mType:[0m        module
[1;31mString form:[0m <module 'gzip' from 'C:\\Users\\hp6265bz\\AppData\\Local\\anaconda3\\envs\\polars\\Lib\\gzip.py'>
[1;31mFile:[0m        c:\users\hp6265bz\appdata\local\anaconda3\envs\polars\lib\gzip.py
[1;31mDocstring:[0m  
Functions that read and write gzipped files.

The user of the file doesn't have to worry about the compression,
but random access is not allowed.

In [7]:
url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz"
output = "./data/1750.csv.gz"
file_name = wget.download(url, "./data")

In [8]:
with gzip.open("./data/1750.csv.gz", "r") as f:
    lines = f.readlines()
lines[:10]

[b'ASN00002061,17500201,PRCP,56,,,a,\n',
 b'ASN00003014,17500201,PRCP,0,,,a,\n',
 b'ASN00003059,17500201,PRCP,0,,,a,\n',
 b'ASN00003088,17500201,PRCP,0,,,a,\n',
 b'ASN00009015,17500201,PRCP,0,,,a,\n',
 b'ASN00009193,17500201,TMIN,187,,,a,\n',
 b'ASN00009193,17500201,PRCP,0,,,a,\n',
 b'ASN00009500,17500201,DATX,2,,,a,\n',
 b'ASN00009500,17500201,MDTX,210,,,a,\n',
 b'ASN00009592,17500201,DATX,4,,,a,\n']

In [9]:
output = [l.decode("utf-8") for l in lines]
output[:10]

['ASN00002061,17500201,PRCP,56,,,a,\n',
 'ASN00003014,17500201,PRCP,0,,,a,\n',
 'ASN00003059,17500201,PRCP,0,,,a,\n',
 'ASN00003088,17500201,PRCP,0,,,a,\n',
 'ASN00009015,17500201,PRCP,0,,,a,\n',
 'ASN00009193,17500201,TMIN,187,,,a,\n',
 'ASN00009193,17500201,PRCP,0,,,a,\n',
 'ASN00009500,17500201,DATX,2,,,a,\n',
 'ASN00009500,17500201,MDTX,210,,,a,\n',
 'ASN00009592,17500201,DATX,4,,,a,\n']

In [10]:
with open ("./data/1750.csv", "w") as f:
    out = f.writelines(output)

## Problem 2 - Find all stations within 500 miles of Winona, MN.

The file linked below contains information about all stations tracked by NOAA.  

*Main folder:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/

*Station txt file:* https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

*Note.* While it would be easier to use the CSV version of the station file, you should use the TXT version here (for practice).

**Your tasks** Our goal is to get a list of stations that are within 500 miles of Winona.  Do this by

1. Using `wget` to download the stations information into the `./data` folder.
2. Use `with` to read the lines of this file.
3. At this point, the lines are strings in a fixed-width format separated by whitespace.  Use a list comprehension with the string split method to split the raw lines (strings) into a list of entries.
4. There are three entries of interest, the station ID and the lat-long coordinates of the station.  Inspect the file to determine the index for these three entries.
5. We want to transform the lines (currently a list of strings) into a record, which is a `dict` with good names for the entries as keys and the values representing the data in an appropriate type (string for station ID, `float` for the lat-long).  Use a comprehension to create a list of records as described.
6. Use another comprehension to apply a filter to the stations, keeping only those within 500 miles of Winona.

In [28]:
import wget
import gzip

In [11]:
# Your code here (add cells as needed)
url =  "https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt"
output = "./data/stations.txt"
file_name = wget.download(url, output)

In [12]:
with open ("./data/stations.txt") as f:
    raw_lines = f.readlines()
raw_lines[:10]

['ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       \n',
 'ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    \n',
 'AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196\n',
 'AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194\n',
 'AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217\n',
 'AEM00041218  24.2620   55.6090  264.9    AL AIN INTL                            41218\n',
 'AF000040930  35.3170   69.0170 3366.0    NORTH-SALANG                   GSN     40930\n',
 'AFM00040938  34.2100   62.2280  977.2    HERAT                                  40938\n',
 'AFM00040948  34.5660   69.2120 1791.3    KABUL INTL                             40948\n',
 'AFM00040990  31.5000   65.8500 1010.0    KANDAHAR AIRPORT                       40990\n']

In [13]:
split_lines = [line.split() for line in raw_lines]
split_lines[:5]

[['ACW00011604',
  '17.1167',
  '-61.7833',
  '10.1',
  'ST',
  'JOHNS',
  'COOLIDGE',
  'FLD'],
 ['ACW00011647', '17.1333', '-61.7833', '19.2', 'ST', 'JOHNS'],
 ['AE000041196',
  '25.3330',
  '55.5170',
  '34.0',
  'SHARJAH',
  'INTER.',
  'AIRP',
  'GSN',
  '41196'],
 ['AEM00041194', '25.2550', '55.3640', '10.4', 'DUBAI', 'INTL', '41194'],
 ['AEM00041217',
  '24.4330',
  '54.6510',
  '26.8',
  'ABU',
  'DHABI',
  'INTL',
  '41217']]

In [32]:
## The indexes for station ID and lat-long, are 0, 1, and 2

In [24]:
dict_rows = [{
        "Station ID": parts[0],
        "Latitude": float(parts[1]),
        "Longitude": float(parts[2])
    }
    for parts in split_lines
]
dict_rows[:10]

[{'Station ID': 'ACW00011604', 'Latitude': 17.1167, 'Longitude': -61.7833},
 {'Station ID': 'ACW00011647', 'Latitude': 17.1333, 'Longitude': -61.7833},
 {'Station ID': 'AE000041196', 'Latitude': 25.333, 'Longitude': 55.517},
 {'Station ID': 'AEM00041194', 'Latitude': 25.255, 'Longitude': 55.364},
 {'Station ID': 'AEM00041217', 'Latitude': 24.433, 'Longitude': 54.651},
 {'Station ID': 'AEM00041218', 'Latitude': 24.262, 'Longitude': 55.609},
 {'Station ID': 'AF000040930', 'Latitude': 35.317, 'Longitude': 69.017},
 {'Station ID': 'AFM00040938', 'Latitude': 34.21, 'Longitude': 62.228},
 {'Station ID': 'AFM00040948', 'Latitude': 34.566, 'Longitude': 69.212},
 {'Station ID': 'AFM00040990', 'Latitude': 31.5, 'Longitude': 65.85}]

In [25]:
from geopy.distance import distance

winona = (44.050556, -91.668333)

filtered_stations = [row for row in dict_rows
                        if distance(winona, (row["Latitude"], row["Longitude"])).miles <= 25.0]

In [26]:
len(filtered_stations)

64

#### Problem 3 - Prototype downloading and uncompressing a station file.

Before we download and uncompress all the stations of interest, let's practice on one station file.


1. Copy the url for some station and store is as a variable named `url`.
2. Write `lambda` functions that extract each of the following from the station `url`: compressed file name, compressed file path (e.g., `./data/...`), and uncompressed file path (e.g., `./data/...`).
3. Write a `lambda` function that extracts
4. Use `wget` to download this stations data.
5. Use `gzip` to uncompress the data.
6. Write the data to out output file.

Your code should have the following shape:

```{Python}
wget.download(...)
with gzip.open(...) as f:
    with open(..., 'w') as out:
        f.readlines()
        out.writelines(f)
```

You should be using your helper functions to, in part, fill in the `...`

In [37]:
# Your code here
url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_year/1750.csv.gz"

get_compressed_filename = lambda u: u.split("/")[-1]
get_compressed_path = lambda fname: f"./data/{fname}"
get_uncompressed_path = lambda fname: f"./data/{fname.replace('.gz','')}"

In [38]:
compressed_filename = get_compressed_filename(url)
compressed_filename

'1750.csv.gz'

In [39]:
compressed_filepath = get_compressed_path(compressed_filename)
compressed_filepath

'./data/1750.csv.gz'

In [40]:
uncompressed_filepath = get_uncompressed_path(compressed_filename)
uncompressed_filepath

'./data/1750.csv'

In [41]:
test_file = wget.download(url, out=compressed_filepath)
test_file

'./data/1750.csv (1).gz'

In [42]:
wget.download(url, out = compressed_filepath)
with gzip.open(compressed_filepath, "rt") as f:
    with open(uncompressed_filepath, "w") as out:
        f.readline()
        out.writelines(f)

## Problem 4 - Build the station URLs and download the files.

**Tasks.** Now you need to build urls for all stations of interest by

1. Use a comprehension to extract the stations of interest into a list.
2. Investigating the structure of the files stored in the `by_station` folder (see main folder link above).
3. Use a comprehension and an `f` string to build a list of URLS for all stations of interest.
4. Use `wget` to download the data for the stations of interest into the data folder.
5. Use `gzip` to uncompress the files.
6. Convert the `bytes` to `str` of format `utf-8`.
7. Use the append mode `"a"` of `open` with `writelines` to append the data in each file to your output file.

While we usually avoid using a `for` loop, we make an exception for code for lengthy IO.  To accomplish steps 4 & 5, use a `for` loop with the following shape.

```{Python}
for url in station_urls:
    wget.download(...)
    with gzip.open(...) as f:
        with open(..., 'a') as out:
            f.readlines()
            ... # Convert lines to strings here
            out.writelines(f)
    print(f"Downloaded and extracted the data for {url}")
```

Note that the code inside the loop should resemble the code from the previous step.

In [27]:
# Your code here.
station_ids = [row["Station ID"] for row in filtered_stations]
station_ids[:5]

['US1MNHS0001', 'US1MNHS0006', 'US1MNHS0007', 'US1MNHS0008', 'US1MNHS0009']

In [29]:
base_url = "https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station"
station_urls = [f"{base_url}/{sid}.csv.gz" for sid in station_ids]
station_urls[:5]

['https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0001.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0006.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0007.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0008.csv.gz',
 'https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0009.csv.gz']

In [30]:
output_file = "./data/all_stations_data.csv" 
# adding a header
header = "ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME\n"
with open(output_file, "w", encoding="utf-8") as f:
    f.write(header)

In [31]:
compressed_filename = url.split("/")[-1]             
compressed_filepath = f"./data/{compressed_filename}" 

for url in station_urls:
    wget.download(url, out = compressed_filepath)
    with gzip.open(compressed_filepath, "rt", encoding = "utf-8") as f:
        with open(output_file, "a", encoding = "utf-8") as out:
            lines = f.readlines()
            out.writelines(lines)
    print(f"Downloaded and extracted the data for {url}")


Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0001.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0006.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0007.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0008.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0009.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0012.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0013.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_station/US1MNHS0022.csv.gz
Downloaded and extracted the data for https://www.ncei.noaa.gov/pub/data/ghcn/daily/by_s