# 📅 Day 2 — Data Cleaning & Preparation

*RTU Data Analysis & Visualization CPD course*

**📚 Instruction (3h)**  
- 🧹 Handling missing values  
- 🗑 Removing duplicates  
- 🔄 Data type conversion  
- 📅 Parsing dates  
- 🏗 Feature engineering basics  
- 🔗 Combining datasets  
- 🏷 Intro to categorical encoding  

**🛠 Practical (1h)**  
- 🧽 Clean a messy dataset  
- 🔀 Merge with a secondary dataset  

**🔄 Reflection (1h)**  
- 🧐 Review: common pitfalls in cleaning  
- 💬 Discuss real-world cleaning challenges  
- 📝 Recap exercise: identify cleaning steps for a small example dataset

## 🎯 Goals for the Day
- Strengthen Python basics (functions, loops, if/else, file handling)
- Learn to process raw messy text files into usable form
- Apply pandas methods to clean incomplete/messy data
- Merge multiple datasets into a single unified dataframe

## 💡 Motivation / Explanation

### Introduction to Data Cleaning and Preparation

- **Why cleaning is critical before analysis**
  - Raw data is almost never ready for direct analysis
  - Errors, inconsistencies, and missing information can distort results
  - Proper cleaning ensures reliability, reproducibility, and trust in analysis outcomes

- **Real-world examples of messy data**
  - 🧹 Handling missing values - Weather records with missing timestamps or corrupt values
  - 🗳️ Survey responses with inconsistent categories (e.g., "Male", "male", "M")
  - 🗑 Removing duplicates - Financial transactions with duplicate entries
  - 🗑 Log files with noise lines, system messages, or broken encodings
  - 📅 Parsing dates - Event logs with inconsistent timestamp formats
  - 🔄 Data type conversion - User age recorded as text instead of numbers

> Think of data cleaning as *“washing vegetables before cooking”* — not exciting, but essential for a good meal.


## Weather Dataset: `latvia_meteo_1925_messy.zip`

Let us imagine we are helping Toms Bricis with weather data analysis for the year 1925. We have come across a bundle of messy text files that require cleaning and preparation.

- **What it is:** a bundle of **five “messy” text files** (≈50 rows each) simulating daily measurements from Latvian stations in **1925**:
  - **Rīga-University** — *Period 1 (Jan–Mar)*
  - **Rīga-University** — *Period 2 (Sep–Nov)*
  - **Liepāja** — *Apr–Jun*
  - **Mērsrags** — *Feb–May*
  - **Alūksne** — *Oct–Dec*

- **Columns present (but order varies by file):**  
  `date`, `t_max_c`, `t_min_c`, `precip_24h_mm`, `precip_type`, `present_weather_code`, `notes`

- **Deliberate “messiness” to practice cleaning:**
  - **Different separators:** `;`, `,`, `|`, and **TAB** (documented in each file’s `# fields=` header).
  - **Mixed column order** across files (use the header to map columns).
  - **Date formats vary** (`YYYY-MM-DD`, `DD.MM.YYYY`, `YYYY/MM/DD`, `DD-MM-YYYY`, `MM-DD-YYYY`, `YYYY.MM.DD`) and sometimes **include a time** (e.g., `07:00`).
  - **Numeric quirks:** decimal **commas** (e.g., `0,6`), **units** in strings (e.g., `0.8 mm`), and the Latvian word **“nulle”** for zero.
  - **Missing values** sprinkled in as `""`, `NA`, `—`, `-999`.
  - **Codes as strings** with possible leading zeros (e.g., `present_weather_code = "05"`).
  - **Free-text `notes`** in Latvian from the station master (may be blank/missing).

- **Intended skills to practice (Day 2):**
  - Detect & use **separators/column order** from headers.
  - **Parse heterogeneous dates** (with optional times).
  - Normalize **numerics/units** (decimal commas, `mm`, worded zeros).
  - Unify **missing values** and enforce **types** (e.g., cast weather codes to integers).
  - Keep useful **categorical text** (`precip_type`, `notes`) intact.

As part of our workflow we will want to verify whether the above descriptions of messiness hold true for our specific dataset files. This will help us tailor our cleaning approach effectively.


## Part 1: Python Fundamentals for Data Cleaning

For our first part we will use basic Python programming skills to explore and understand the dataset structure before diving into the cleaning process.

### 🔑 Key Idea - Loops Go Brrr

One of key advantages of programming is that we can automate repetitive tasks using loops. This is especially useful when working with datasets, as it allows us to apply the same operations to multiple rows or files without having to write redundant code.

Similarly loops let us figure out an approach that works for a single file and then easily adapt it to others.


### Getting ready for work

Typically in a finished notebook (and also normal scripts / programs), we want to start with a clear setup phase. This includes:

1. **Importing Libraries:** Load all necessary libraries at the beginning.
2. **Setting Up Paths:** Define file paths and other constants.
3. **Configuring Options:** Set any options or preferences (e.g., display settings).

By organizing our code this way, we make it easier to understand and modify later on.


In [1]:
# usually we start with general Python imports
from pathlib import Path # for file and file path related tasks
import sys, platform, os, io, shutil, zipfile, re # system related tasks
from datetime import datetime
# first datetime
print(f"Today : {datetime.now().isoformat(timespec='seconds')}")
# now Python version
print(f"Python : {sys.version}")

# then we import external libraries
# external - not part of Python installation
# on Google Colab those are already installed
try:
    import pandas as pd
    print('pandas:', pd.__version__)
except ImportError:
    print(f"pandas not installed. Install with `pip install pandas`.")
    # for excel support extra instructions
    print(f"Install `openpyxl` for Excel support with `pip install openpyxl`.")
# requests is a widely used network library that makes internet "requests" easier
try:
    import requests
except ImportError:
    requests = None
    print('requests not installed. Install with `pip install requests`.')


# we can also print out what type of environment we are running in, this could show OS information
print('Runtime:', platform.platform())
# We could show system RAM and free RAM but that would require either a non standard library
# or we would have to write some extra functions we skip this for now
# you can ask LLM to write these functions for you
# alternatively there are external libraries like psutil that do this out of the box

# Let us show our current drive space
print(f"Total Current Drive Space: {shutil.disk_usage('/').total / (1024**3):.2f} GB")
print(f"Free Current Drive Space: {shutil.disk_usage('/').free / (1024**3):.2f} GB")

# Current Working Directory
print(f"Current Working Directory: {Path.cwd()}")
# note that in some cases you might not want to provide all this information to the public, if you have a super secret computer...


Today : 2025-08-20T15:22:55
Python : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
pandas: 2.2.2
Runtime: Linux-6.1.123+-x86_64-with-glibc2.35
Total Current Drive Space: 107.72 GB
Free Current Drive Space: 68.94 GB
Current Working Directory: /content


### 🧑‍💻 Functions

### What is the idea behind functions in programming?

In programming, a function is a block of reusable code that performs a specific task. Think of it like a miniature program within your main program. Functions are designed to:

- **Break down complex problems:** Large problems can be divided into smaller, manageable parts, each handled by a function. This makes code easier to write, understand, and debug.
- **Avoid repetition (DRY principle - Don't Repeat Yourself):** If you need to perform the same set of actions multiple times, you can define a function once and call it whenever needed, rather than writing the same code repeatedly.
- **Improve code organization and readability:** Functions group related code together, making the overall structure of your program clearer and easier to follow.
- **Enhance code reusability:** Once a function is defined, it can be used in different parts of the same program or even in other programs.
- **Simplify debugging:** If there's an issue, you can isolate the problem to a specific function, making it easier to find and fix the error.

In essence, functions are tools for modularity and abstraction in programming, allowing you to create more organized, efficient, and maintainable code.

### What is a function in Python?

In Python, a function is defined using the `def` keyword, followed by the function name, parentheses `()`, and a colon `:`. The code block within the function is indented. Functions can optionally take inputs called *arguments* (placed inside the parentheses) and can return a value using the `return` keyword.

Here's a basic structure of a Python function:

In [3]:
def greet(name: str = 'student') -> str:
    """
    Returns a friendly greeting for the given name.
    If no name is provided, defaults to 'student'.
    """
    # Use an f-string to insert the name into the greeting
    return f"Hello, {name}!" # we can insert pretty much any type of data in f-strings

# Call the function and print the result
greeting = greet() # assign results of greet() function to greeting variable
print(greeting)

Hello, student!


In [5]:
# Now that I have this function I can make other greetings
my_greeting = greet("Valdis")
print(my_greeting)
numeric_greeting = greet(808) # function expects str, but no penalty in this case
print(numeric_greeting)

Hello, Valdis!
Hello, 808!


### Function to download and unzip file from url

In [6]:
def download_and_unzip(url: str, target_folder: str | Path = 'sample_data') -> Path:
    """
    Downloads a ZIP file from the given URL and extracts it to the target folder.
    Returns the path to the folder where files were extracted.
    """
    target = Path(target_folder)  # Make sure target is a Path object
    target.mkdir(parents=True, exist_ok=True)  # Create the folder if it doesn't exist
    filename = url.split('/')[-1]  # Get the file name from the URL
    # in case of URL the file name is the last one (so index -1 means last one in a list)
    # next we create full path where we will save the zip file
    zip_path = target / filename  # Full path to save the ZIP file
    # check if library exists
    if requests is None:
        raise RuntimeError('requests required.')
    # Download the file in chunks (good for large files)
    with requests.get(url, stream=True, timeout=60) as r:
        r.raise_for_status()  # Raise an error if download failed
        with open(zip_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    # Unzip the downloaded file
    with zipfile.ZipFile(zip_path, 'r') as zf:
        zf.extractall(target) # this command given a zip file as target will unzip ALL files
    # Optionally, remove the ZIP file after extraction
    # zip_path.unlink(missing_ok=True)  # Remove the ZIP file
    return target  # Return the folder where files were extracted


**Practice dataset for Part 1:** `latvia_meteo_1925_messy.zip` (5 text files)

- URL: https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_messy.zip

In [7]:
# let's download the messy files zip
url = "https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_messy.zip"
print(f"Will download and unzip from following url: {url}")
# let's download the file and extract it under day_2_data
download_and_unzip(url, Path("day_2_data"))

Will download and unzip from following url: https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_messy.zip


PosixPath('day_2_data')

In [8]:
# let's add Google Colab specific code that will offer download to your computer of all the files extracted
from google.colab import files # this is Google Colab specific
# if you were local you would already have these files locally
for p in sorted(Path("day_2_data").glob("*.txt")): # glob looks in current folder
    files.download(p)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

[link text](https://)### 📂 File Handling

First let me show you how to read whole text file into one big text string

In [14]:
# let's make a function that takes a file Path or string and optional encoding with default utf-8 and returns text string
def get_file_contents(path: Path | str, encoding: str = 'utf-8') -> str:
    """
    Reads the contents of a text file and returns it as a string.
    """
    # so path could be string or Path
    # file could have any extension but it should contain text of some sort
    with open(path, 'r', encoding=encoding) as f:
        return f.read()

In [15]:
# let's read aluksne into memory
# aluksne = get_file_contents(Path("day_2_data/aluksne_1925.txt"))
aluksne = get_file_contents("day_2_data/aluksne_1925.txt")
print(aluksne)

# station_name=Alūksne
# period=(Oct–Dec)
# separator_hint=|
# columns_in_this_file= notes | present_weather_code | t_max_c | precip_24h_mm | date | precip_type | t_min_c
# note: some values intentionally messy (units, words, missing, time in date)
# fields=notes|present_weather_code|t_max_c|precip_24h_mm|date|precip_type|t_min_c
—|53|5.9|3.0 mm|10-19-1925|R|5.5
Rīts auksts|-999|9.1|2.3 mm|18.10.1925|mixed|4.2
Daļēji mākoņains|65|5.4|0.0|11-10-1925||-0.5
Daļēji mākoņains|50|15.6|0.5|10-07-1925 14:00|S|10.4
Neliels vējš|63|-1.1|0.0|12-28-1925|—|—
Vēlāk sāka līt|-999|5.3||12-07-1925 19:00|none|
Pūtis brāzmas|82|4.2|0.6 mm|11-07-1925|S|0.1
—|61|-7.2|1.4|12-03-1925|mixed|-6.5
Laikam migla|51|3.3|1.0 mm|11-27-1925|mixed|-1.3
Mērīšanas kļūda?|70|-3.5999999999999996|2.1|12-06-1925|M|-2.9
Skursteņi kūp|71|6.2|0.0|10-21-1925||6.9
NA|61|1.3|2.1|28.11.1925|NA|2.0
Ap pusdienlaiku saule|80|8.4|1.9 mm|02-11-1925|snow|-1.5
|53|8.4||10-01-1925 07:00||3.3
Mērījums apstiprināts|53|10.8|0.7 mm|10-08-1925

### Iterating over file line at a time

Above example showed how we could read a whole file into memory.
However that means we would be working with file as one big string as a whole. We could do some replace operations.

Much more often we will want to work on file one line(row) at a time.

In [16]:
def iter_lines(path: Path | str):
    """
    Yields each line from a text file, removing the newline character at the end.
    Useful for reading files line by line.
    """
    # Open the file for reading (UTF-8 encoding, replace errors)
    with open(path, 'r', encoding='utf-8', errors='replace') as f:
        for line in f:
            yield line.rstrip('\n')  # Remove the newline at the end of each line
            # why yield? # because it allows processing each line one at a time, which is more memory efficient

### 🔄 For Loops

In [17]:
def count_lines(path: Path) -> tuple[int, int]:
    """
    Counts the total number of lines and the number of non-empty lines in a file.
    Returns a tuple: (total_lines, nonempty_lines)
    """
    total, nonempty = 0, 0  # Initialize counters
    for line in iter_lines(path):  # Go through each line in the file
        total += 1
        if line.strip():  # Check if the line is not just whitespace
            nonempty += 1
    return total, nonempty # this is how we return two values at once in Python

In [18]:
# let's count lines in aluksne_1925
count_lines(Path("day_2_data/aluksne_1925.txt"))

(42, 42)

### 🌳 If / Else Branching

In [19]:
# note how the function is called is_
def is_good_line(line: str) -> bool:
    """
    Checks if a line of text is 'good' (not empty, not a comment, and long enough).
    Returns True if the line is good, False otherwise.
    """
    s = line.strip()  # Remove whitespace from both ends
    # so now s is a string of text - possibly empty "" or something but without whitespace (\n, \t, at left or right side)
    if not s: return False  # Skip empty lines
    if s.startswith('#'): return False  # Skip comment lines
    if len(s) < 5: return False  # Skip very short lines, this is an assumption
    # you could add any logic you want here
    # idea is that if we checked everything bad possible, then we return True meaning this is a good line
    return True

### 🧹 Building a Cleaning Function

In [20]:
def clean_file(path: Path) -> tuple[int, int]:
    """
    Reads a file and writes 'good' lines to one file and 'bad' lines to another.
    Returns a tuple: (number_of_good_lines, number_of_bad_lines)
    """
    out_good = path.with_suffix('.good.txt')  # Output file for good lines
    out_bad = path.with_suffix('.bad.txt')    # Output file for bad lines
    good, bad = 0, 0  # Counters for good and bad lines
    # Open the input and output files
    # the cool thing is that this recipe only opens file one row at a time
    # so you could work on files that do not fit in your memory even multi TB files
    with path.open('r', encoding='utf-8', errors='replace') as fin, \
         out_good.open('w') as fg, out_bad.open('w') as fb:
        # so we loop through the input text file one row/line at a time
        # very memory efficient
        for line in fin:
            if is_good_line(line):
                fg.write(line)  # Write good line
                good += 1
            else:
                fb.write(line)  # Write bad line
                bad += 1
    # we do not have to return anything in Python then we return None by default
    return good, bad  # Return the counts

In [21]:
# let's try running it on aluksne_1925.txt
clean_file(Path("day_2_data/aluksne_1925.txt"))

(36, 6)

### 📁 Extending to Folders

In [24]:
def clean_files(folder: Path):
    """
    Cleans all .txt files in the given folder using clean_file().
    Returns a dictionary with file paths as keys and (good, bad) counts as values.
    """
    results = {}  # Dictionary to store results for each file
    for file in folder.glob('*.txt'):
        if "good" in file.name or "bad" in file.name: continue  # Skip already processed files # this means I do not do anything with this file
        # TODO good practice would but "good" "bad" as parameters
        results[file] = clean_file(file)  # Clean each file and store the result
    return results # we return a dictionary of results

In [23]:
# let's run it day_2_data folder
results = clean_files(Path("day_2_data"))

In [25]:
results

{PosixPath('day_2_data/aluksne_1925.txt'): (36, 6),
 PosixPath('day_2_data/liepaja_1925.txt'): (36, 6),
 PosixPath('day_2_data/riga_university_1925_p2.txt'): (36, 6),
 PosixPath('day_2_data/mersrags_1925.txt'): (48, 6),
 PosixPath('day_2_data/riga_university_1925_p1.txt'): (36, 6)}

## Determining the separator character

Our files have helpfully provided a separator hint without which it would be very hard to determine.

`# separator_hint=|`
`# separator_hint=TAB`
`# separator_hint=,`

Let's write a function that extracts hint from the file


In [26]:
def get_sep(path: Path) -> str:
    """
    We look for first line that contains
    `# separator_hint=|`
`# separator_hint=TAB`
`# separator_hint=,`
    """
    needle = "separator_hint="
    sep = None # we start with assumption that we do not know the separator
    # lets open file and go through line by line
    with open(path, 'r', encoding='utf-8', errors='replace') as f:
        for line in f:
            if needle in line: # we check for presence of needle
            # we could have used regular expression but no need here
                # let's find whatever is after needle
                # so we split by needle at take last part
                sep = line.split(needle)[-1].strip()
    # if sep is TAB we need to return \t
    if sep == "TAB": # special case TAB which we need to convert
        return "\t"
    return sep

In [27]:
# let's see if we can find separator for aluksne_1925
get_sep(Path("day_2_data/aluksne_1925.txt"))

'|'

In [28]:
# let's find it for all txt files that do not contain good or bad
for p in sorted(Path("day_2_data").glob("*.txt")):
    if "good" in p.name or "bad" in p.name: continue
    # we could have used bad as well
    print(p, get_sep(p))

day_2_data/aluksne_1925.txt |
day_2_data/liepaja_1925.txt ,
day_2_data/mersrags_1925.txt 	
day_2_data/riga_university_1925_p1.txt ;
day_2_data/riga_university_1925_p2.txt ;


### 📥 Loading Cleaned Files into DataFrames

In [31]:
# we might want to supply our own separator

def load_cleaned_file(path: Path, sep: str = ",") -> pd.DataFrame:
    with path.open('r') as f:
        lines = [l.strip().split(sep) for l in f if l.strip()]
    maxlen = max(len(r) for r in lines) if lines else 0
    cols = [f'col{i+1}' for i in range(maxlen)] if maxlen else []
    return pd.DataFrame(lines, columns=cols)

In [33]:
# let's test if we get dataframe on liepaja_1925.txt
liepaja_df = load_cleaned_file(Path("day_2_data/liepaja_1925.good.txt"))
liepaja_df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
0,13,2,1925/05/26,none,75,0,0 mm,-999,Mērīšanas kļūda?,
1,24,4,07.06.1925,rain,75,0,6,16,8,Neliels vējš
2,12,600000000000001,1925/05/20,S,50,0,4 mm,13,3,Pūtis brāzmas
3,15,8,1925/06/16,,73,0,2,10,8,Skursteņi kūp
4,21,9,1925/06/25 07:00,—,81,0,0 mm,15,4,
5,17,1,1925/05/25,mixed,81,2,4 mm,9,1,Daļēji mākoņains
6,7,4,1925/04/16,snow,90,1,5 mm,,,
7,5,0,1925/04/17 13:00,R,90,,5,7,Neliels vējš,
8,,1925/05/23,,80,,-999,Mērījums apstiprināts,,,
9,17,1,1925/05/18,none,82,-999,10,5,Ap pusdienlaiku saule,


In [34]:
# first let's get all good.txt files as list
good_files = sorted(Path("day_2_data").glob("*.good.txt"))
good_files


[PosixPath('day_2_data/aluksne_1925.good.txt'),
 PosixPath('day_2_data/liepaja_1925.good.txt'),
 PosixPath('day_2_data/mersrags_1925.good.txt'),
 PosixPath('day_2_data/riga_university_1925_p1.good.txt'),
 PosixPath('day_2_data/riga_university_1925_p2.good.txt')]

In [38]:
# let's convert atll Paths to strings
good_files = [str(p) for p in good_files]
good_files


['day_2_data/aluksne_1925.good.txt',
 'day_2_data/liepaja_1925.good.txt',
 'day_2_data/mersrags_1925.good.txt',
 'day_2_data/riga_university_1925_p1.good.txt',
 'day_2_data/riga_university_1925_p2.good.txt']

In [39]:
# for each good file find matching bad file meaning suffix bad.txt instead of good.txt
bad_files = [p.replace('.good.txt', '.bad.txt') for p in good_files]
bad_files

['day_2_data/aluksne_1925.bad.txt',
 'day_2_data/liepaja_1925.bad.txt',
 'day_2_data/mersrags_1925.bad.txt',
 'day_2_data/riga_university_1925_p1.bad.txt',
 'day_2_data/riga_university_1925_p2.bad.txt']

In [40]:
# now let's convert all bad_files back to path
bad_files = [Path(p) for p in bad_files]
bad_files

[PosixPath('day_2_data/aluksne_1925.bad.txt'),
 PosixPath('day_2_data/liepaja_1925.bad.txt'),
 PosixPath('day_2_data/mersrags_1925.bad.txt'),
 PosixPath('day_2_data/riga_university_1925_p1.bad.txt'),
 PosixPath('day_2_data/riga_university_1925_p2.bad.txt')]

In [41]:
# let's check if these bad files exist
for p in bad_files:
    print(p, p.exists())

day_2_data/aluksne_1925.bad.txt True
day_2_data/liepaja_1925.bad.txt True
day_2_data/mersrags_1925.bad.txt True
day_2_data/riga_university_1925_p1.bad.txt True
day_2_data/riga_university_1925_p2.bad.txt True


In [42]:
# now let's convert good_files back to Paths as well
good_files = [Path(p) for p in good_files]
good_files

[PosixPath('day_2_data/aluksne_1925.good.txt'),
 PosixPath('day_2_data/liepaja_1925.good.txt'),
 PosixPath('day_2_data/mersrags_1925.good.txt'),
 PosixPath('day_2_data/riga_university_1925_p1.good.txt'),
 PosixPath('day_2_data/riga_university_1925_p2.good.txt')]

In [43]:
# so now let us iterate (loop) through both lists of files at the same time
# we will extract separator from bad text file and extract DataFrame from good text file
# we will store Dataframes in a dictionary
dataframe_dict = {}
for good, bad in zip(good_files, bad_files): # so we take one item at a time from each list of items
    sep = get_sep(bad)
    print(f"Voila found separator {sep} in {bad} file")
    dataframe_dict[good.stem] = load_cleaned_file(good, sep=sep)

# print size of dictionary
print(f"Dictionary size: {len(dataframe_dict)}")

Voila found separator | in day_2_data/aluksne_1925.bad.txt file
Voila found separator , in day_2_data/liepaja_1925.bad.txt file
Voila found separator 	 in day_2_data/mersrags_1925.bad.txt file
Voila found separator ; in day_2_data/riga_university_1925_p1.bad.txt file
Voila found separator ; in day_2_data/riga_university_1925_p2.bad.txt file
Dictionary size: 5


In [44]:
# let us see what keys we have in our dictionary
dataframe_dict.keys()

dict_keys(['aluksne_1925.good', 'liepaja_1925.good', 'mersrags_1925.good', 'riga_university_1925_p1.good', 'riga_university_1925_p2.good'])

In [45]:
# let us print shape of each DateFrame
for key, value in dataframe_dict.items():
    print(f"Dataframe {key} has shape {value.shape}")

Dataframe aluksne_1925.good has shape (36, 7)
Dataframe liepaja_1925.good has shape (36, 10)
Dataframe mersrags_1925.good has shape (48, 7)
Dataframe riga_university_1925_p1.good has shape (36, 7)
Dataframe riga_university_1925_p2.good has shape (36, 7)


In [46]:
# let us save all dataframes as excel files
for key, value in dataframe_dict.items():
    value.to_excel(f"{key}.xlsx")

In [47]:
# on Colab let us download all xlsx files in current directory
from google.colab import files
for p in sorted(Path(".").glob("*.xlsx")):
    files.download(p)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 2: Guided Exercise — Latvia Weather Data (Extra Messy)

**Duration:** ~30 minutes  
**Dataset:** `latvia_meteo_1925_extra_messy.zip`  
**URL:** https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_extra_messy.zip

### 🎯 Objective
Convert multiple extra-messy weather text files into **one cleaned file per source**, then load each into a **separate DataFrame**.

### ✅ Success Criteria
- Each original text file has a corresponding `.good.txt` output.
- Each `.good.txt` loads into a DataFrame without errors.
- Basic column consistency achieved (same number of columns and sensible types where possible).

### 🔍 What to Watch For
- Junk header/footer lines (e.g., comments, separators)
- Inconsistent separators (`,`, `;`, tabs, or spaces)
- Missing fields and short/empty lines
- Non-UTF8 characters — use `errors='replace'` if needed

### 🧭 Suggested Workflow
1) **Download & unzip** to `data/` using `download_and_unzip`  
2) **List files** and do a quick **line count** with `count_lines`  
3) **Clean** with `clean_files(data_dir)`  
4) **Load** each cleaned file with `load_cleaned_file`  
5) **Sanity-check**: `.head()`, `.info()`, and simple value counts on key columns

### 🧩 Hints
- If a file still fails to parse, adjust `is_good_line` (e.g., skip lines that start with specific tokens).
- If different files use different separators, handle at **pandas** stage later (Part 3) by re-parsing columns.
- Keep outputs organized: write cleaned files into a `data/cleaned/` subfolder if you choose to extend `clean_files`.

In [None]:
# --- SKELETON (students fill in) ---
EXTRA_URL = 'https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_extra_messy.zip'
DATA_DIR = Path('data')

# 1) Download & unzip
# download_and_unzip(EXTRA_URL, DATA_DIR)

# 2) Inspect: list files & counts
# for p in sorted(DATA_DIR.glob('*.txt')):
#     print(p.name, '->', count_lines(p))

# 3) Clean all files
# results = clean_files(DATA_DIR)
# results

# 4) Load cleaned files
# dfs_extra = {}
# for p in sorted(DATA_DIR.glob('*.good.txt')):
#     dfs_extra[p.stem] = load_cleaned_file(p)
# {k: v.head() for k, v in dfs_extra.items()}

### 🧪 Checkpoints
- At least **N≥3** cleaned files successfully load into DataFrames.
- No parsing exceptions on `.head()` or `.info()`.
- You can explain (in comments) which rules your `is_good_line` used.

### 🛠 Extension (Optional)
- Write a variant `clean_files(folder, out_dir=Path('data/cleaned'))` that writes outputs into a subfolder.
- Add a **regex-based** `is_good_line_regex` that only keeps lines starting with `YYYY-MM-DD`.

## Part 3: Pandas-Specific Data Cleaning

### Overview
In this section, you will standardize each DataFrame from Part 2 so they share a **common schema** and are ready to merge.

### Target Schema (example)
- `date` (datetime)
- `station` (string/category)
- `t_min` (float)
- `t_max` (float)
- `precip` (float)

### Typical Operations
1. **Column detection & renaming** – bring different column names to a shared set
2. **Type coercion** – numbers via `pd.to_numeric(errors='coerce')`, dates via `pd.to_datetime(errors='coerce')`
3. **Missing values** – `dropna` or `fillna` depending on context
4. **Duplicates** – `.duplicated()` + `.drop_duplicates()`
5. **Categoricals** – normalize text (`strip`, `title`, `upper`) and `astype('category')` if useful
6. **Validation** – quick assertions (e.g., date not null, temperature ranges plausible)

### Step-by-Step Guide
1) **Pick one DataFrame** from `dfs_extra` and print `.head()`, `.columns`, `.info()`
2) **Map columns** to target names (e.g., `temp_min` → `t_min`)
3) **Coerce**:
   - `date = pd.to_datetime(df['date'], errors='coerce')`
   - `df[['t_min','t_max','precip']] = df[['t_min','t_max','precip']].apply(pd.to_numeric, errors='coerce')`
4) **Handle missing**: start conservative (e.g., drop rows missing `date` or all temperature columns)
5) **Standardize station names**: `df['station'] = df['station'].astype(str).str.strip().str.title()`
6) **Check duplicates** and remove
7) **Repeat** for all DataFrames

### Common Pitfalls & Tips
- Treat ambiguous `-` or `NA` strings as missing (`na_values=["-","NA","N/A"]` if you re-read with `read_csv`)
- Some files might have **merged columns**; split using `.str.split(',', expand=True)` when necessary
- If a file lacks a column, create it with `pd.NA` so the schema lines up later

### 🧱 Skeleton: Inspect & Rename

In [None]:
# Example skeleton for one dataframe named df
# df = dfs_extra['some_file']
# print(df.head()); print(df.columns); df.info()

# rename_map = {
#     'Date': 'date', 'DATE':'date',
#     'Station':'station', 'City':'station',
#     'Tmin':'t_min', 'TminC':'t_min', 'Min':'t_min',
#     'Tmax':'t_max', 'TmaxC':'t_max', 'Max':'t_max',
#     'Precip':'precip', 'Rain':'precip'
# }
# df = df.rename(columns=lambda c: rename_map.get(str(c), str(c).strip().lower()))


### 🧱 Skeleton: Type Coercion & Missing Handling

In [None]:
# required_cols = ['date','station','t_min','t_max','precip']
# for c in required_cols:
#     if c not in df.columns:
#         df[c] = pd.NA

# df['date'] = pd.to_datetime(df['date'], errors='coerce')
# for c in ['t_min','t_max','precip']:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

# # Drop rows with no usable date
# df = df.dropna(subset=['date'])

# # Optional: fill precip missing with 0 if domain-appropriate
# # df['precip'] = df['precip'].fillna(0)

### 🧱 Skeleton: Text Normalization & Duplicates

In [None]:
# df['station'] = df['station'].astype(str).str.strip().str.title()
# before = len(df)
# df = df.drop_duplicates()
# print('Removed', before - len(df), 'duplicate rows')

### 🧪 Suggested Sanity Checks

In [None]:
# assert df['date'].notna().all(), 'Null dates remain'
# # Optional plausibility checks (adjust to real units)
# assert (df['t_min'] <= df['t_max']).dropna().all(), 'Found t_min > t_max'

## Part 4: Merging Cleaned DataFrames

### Goal
Combine all standardized DataFrames into **one big DataFrame** with a **unified column structure**.

### Strategy
1. **Define the target schema** used in Part 3.
2. **Align each DataFrame** to the schema (add missing columns, reorder).
3. **Concatenate** with `pd.concat`.
4. **Final cleanup**: deduplicate, reindex, and sort by date/station.
5. **Save outputs** (`CSV` or `Parquet`) for Day 3 (EDA).

### Integration Checklist
- All DataFrames have columns: `date, station, t_min, t_max, precip`
- Dtypes are consistent across DataFrames
- No catastrophic loss of rows during coercion
- Final row count equals the sum of inputs minus duplicates

### 🧱 Skeleton: Alignment & Concatenation

In [None]:
# Suppose you have a dict of cleaned dfs: dfs_clean
# target_cols = ['date','station','t_min','t_max','precip']

# def coerce_to_schema(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
#     for c in cols:
#         if c not in df.columns:
#             df[c] = pd.NA
#     # Reorder and drop extras for now
#     return df[cols]

# aligned = [coerce_to_schema(d.copy(), target_cols) for d in dfs_clean.values()]
# big = pd.concat(aligned, axis=0, ignore_index=True)
# big = big.drop_duplicates().reset_index(drop=True)
# big = big.sort_values(['date','station'])
# big.head()

### 🧾 Export for Day 3

In [None]:
# out_dir = Path('outputs'); out_dir.mkdir(exist_ok=True)
# big.to_csv(out_dir / 'latvia_meteo_1925_cleaned_merged.csv', index=False)
# # Optional: Parquet for speed/size
# # big.to_parquet(out_dir / 'latvia_meteo_1925_cleaned_merged.parquet', index=False)

## 🔄 Reflection
- What kinds of messiness were easier to fix with **Python basics**?
- What kinds of messiness required **pandas**?
- What are the risks of “over-cleaning” or discarding too much data?