# 📅 Day 2 — Data Cleaning & Preparation

*RTU Data Analysis & Visualization CPD course*

**📚 Instruction (3h)**  
- 🧹 Handling missing values  
- 🗑 Removing duplicates  
- 🔄 Data type conversion  
- 📅 Parsing dates  
- 🏗 Feature engineering basics  
- 🔗 Combining datasets  
- 🏷 Intro to categorical encoding  

**🛠 Practical (1h)**  
- 🧽 Clean a messy dataset  
- 🔀 Merge with a secondary dataset  

**🔄 Reflection (1h)**  
- 🧐 Review: common pitfalls in cleaning  
- 💬 Discuss real-world cleaning challenges  
- 📝 Recap exercise: identify cleaning steps for a small example dataset

## 🎯 Goals for the Day
- Strengthen Python basics (functions, loops, if/else, file handling)
- Learn to process raw messy text files into usable form
- Apply pandas methods to clean incomplete/messy data
- Merge multiple datasets into a single unified dataframe

## 💡 Motivation / Explanation

### Introduction to Data Cleaning and Preparation

- **Why cleaning is critical before analysis**
  - Raw data is almost never ready for direct analysis
  - Errors, inconsistencies, and missing information can distort results
  - Proper cleaning ensures reliability, reproducibility, and trust in analysis outcomes

- **Real-world examples of messy data**
  - 🧹 Handling missing values - Weather records with missing timestamps or corrupt values
  - 🗳️ Survey responses with inconsistent categories (e.g., "Male", "male", "M")
  - 🗑 Removing duplicates - Financial transactions with duplicate entries
  - 🗑 Log files with noise lines, system messages, or broken encodings
  - 📅 Parsing dates - Event logs with inconsistent timestamp formats
  - 🔄 Data type conversion - User age recorded as text instead of numbers

> Think of data cleaning as *“washing vegetables before cooking”* — not exciting, but essential for a good meal.


## Weather Dataset: `latvia_meteo_1925_messy.zip`

Let us imagine we are helping Toms Bricis with weather data analysis for the year 1925. We have come across a bundle of messy text files that require cleaning and preparation.

- **What it is:** a bundle of **five “messy” text files** (≈50 rows each) simulating daily measurements from Latvian stations in **1925**:
  - **Rīga-University** — *Period 1 (Jan–Mar)*
  - **Rīga-University** — *Period 2 (Sep–Nov)*
  - **Liepāja** — *Apr–Jun*
  - **Mērsrags** — *Feb–May*
  - **Alūksne** — *Oct–Dec*

- **Columns present (but order varies by file):**  
  `date`, `t_max_c`, `t_min_c`, `precip_24h_mm`, `precip_type`, `present_weather_code`, `notes`

- **Deliberate “messiness” to practice cleaning:**
  - **Different separators:** `;`, `,`, `|`, and **TAB** (documented in each file’s `# fields=` header).
  - **Mixed column order** across files (use the header to map columns).
  - **Date formats vary** (`YYYY-MM-DD`, `DD.MM.YYYY`, `YYYY/MM/DD`, `DD-MM-YYYY`, `MM-DD-YYYY`, `YYYY.MM.DD`) and sometimes **include a time** (e.g., `07:00`).
  - **Numeric quirks:** decimal **commas** (e.g., `0,6`), **units** in strings (e.g., `0.8 mm`), and the Latvian word **“nulle”** for zero.
  - **Missing values** sprinkled in as `""`, `NA`, `—`, `-999`.
  - **Codes as strings** with possible leading zeros (e.g., `present_weather_code = "05"`).
  - **Free-text `notes`** in Latvian from the station master (may be blank/missing).

- **Intended skills to practice (Day 2):**
  - Detect & use **separators/column order** from headers.
  - **Parse heterogeneous dates** (with optional times).
  - Normalize **numerics/units** (decimal commas, `mm`, worded zeros).
  - Unify **missing values** and enforce **types** (e.g., cast weather codes to integers).
  - Keep useful **categorical text** (`precip_type`, `notes`) intact.

As part of our workflow we will want to verify whether the above descriptions of messiness hold true for our specific dataset files. This will help us tailor our cleaning approach effectively.


## Part 1: Python Fundamentals for Data Cleaning

For our first part we will use basic Python programming skills to explore and understand the dataset structure before diving into the cleaning process.

### 🔑 Key Idea - Loops Go Brrr

One of key advantages of programming is that we can automate repetitive tasks using loops. This is especially useful when working with datasets, as it allows us to apply the same operations to multiple rows or files without having to write redundant code.

Similarly loops let us figure out an approach that works for a single file and then easily adapt it to others.


### Getting ready for work

Typically in a finished notebook (and also normal scripts / programs), we want to start with a clear setup phase. This includes:

1. **Importing Libraries:** Load all necessary libraries at the beginning.
2. **Setting Up Paths:** Define file paths and other constants.
3. **Configuring Options:** Set any options or preferences (e.g., display settings).

By organizing our code this way, we make it easier to understand and modify later on.


In [1]:
# usually we start with general Python imports
from pathlib import Path # for file and file path related tasks
import sys, platform, os, io, shutil, zipfile, re # system related tasks
from datetime import datetime # this is because datetime module has datetime class
# otherwise if we used import datetime
# we would have to write datetime.datetime for everything
# first datetime
print(f"Today : {datetime.now().isoformat(timespec='seconds')}")
# now Python version
print(f"Python : {sys.version}")

# then we import external libraries
# external - not part of Python installation
# on Google Colab those are already installed
try:
    import pandas as pd # we could fail at import, then next line would not run
    print('pandas:', pd.__version__)
except ImportError:
    print(f"pandas not installed. Install with `pip install pandas`.")
    # for excel support extra instructions
    print(f"Install `openpyxl` for Excel support with `pip install openpyxl`.")
# requests is a widely used network library that makes internet "requests" easier
try:
    import requests # popular web requests library, for scraping, downloading web resources etc
except ImportError:
    requests = None
    print('requests not installed. Install with `pip install requests`.')


# we can also print out what type of environment we are running in, this could show OS information
print('Runtime:', platform.platform())
# We could show system RAM and free RAM but that would require either a non standard library
# or we would have to write some extra functions we skip this for now
# you can ask LLM to write these functions for you
# alternatively there are external libraries like psutil that do this out of the box

# Let us show our current drive space
print(f"Total Current Drive Space: {shutil.disk_usage('/').total / (1024**3):.2f} GB")
print(f"Free Current Drive Space: {shutil.disk_usage('/').free / (1024**3):.2f} GB")

# Current Working Directory
print(f"Current Working Directory: {Path.cwd()}") # more important in local computer
# in Google Colab this should give you /content
# note that in some cases you might not want to provide all this information to the public, if you have a super secret computer...


Today : 2025-09-11T15:07:01
Python : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
pandas: 2.2.2
Runtime: Linux-6.1.123+-x86_64-with-glibc2.35
Total Current Drive Space: 107.72 GB
Free Current Drive Space: 68.73 GB
Current Working Directory: /content


### 🧑‍💻 Functions

### What is the idea behind functions in programming?

In programming, a function is a block of reusable code that performs a specific task. Think of it like a miniature program within your main program. Functions are designed to:

- **Break down complex problems:** Large problems can be divided into smaller, manageable parts, each handled by a function. This makes code easier to write, understand, and debug.
- **Avoid repetition (DRY principle - Don't Repeat Yourself):** If you need to perform the same set of actions multiple times, you can define a function once and call it whenever needed, rather than writing the same code repeatedly.
- **Improve code organization and readability:** Functions group related code together, making the overall structure of your program clearer and easier to follow.
- **Enhance code reusability:** Once a function is defined, it can be used in different parts of the same program or even in other programs.
- **Simplify debugging:** If there's an issue, you can isolate the problem to a specific function, making it easier to find and fix the error.

In essence, functions are tools for modularity and abstraction in programming, allowing you to create more organized, efficient, and maintainable code.

### What is a function in Python?

In Python, a function is defined using the `def` keyword, followed by the function name, parentheses `()`, and a colon `:`. The code block within the function is indented. Functions can optionally take inputs called *arguments* (placed inside the parentheses) and can return a value using the `return` keyword.

Here's a basic structure of a Python function:

In [None]:
def greet(name: str = 'student') -> str:
    """
    Returns a friendly greeting for the given name.
    If no name is provided, defaults to 'student'.
    """
    # """ is so called Docstring a way in Python to include small help snippets
    # name is a variable local to this function
    # Use an f-string to insert the name into the greeting
    # note we are not printing the greeting here, just returning it for use by another part of the code
    return f"Hello, {name}!" # we can insert pretty much any type of data in f-strings

# also nothing should be happening besides the function being added to our memory
# we have not called the function yet

In [None]:
# Call the function and print the result
greeting = greet() # assign results of greet() function to greeting variable
print(greeting)

Hello, student!


In [None]:
# Now that I have this function I can make other greetings
my_greeting = greet("Valdis")
print(my_greeting)
numeric_greeting = greet(80800) # function expects str, but no penalty in this case
print(numeric_greeting)

Hello, Valdis!
Hello, 80800!


### Type Hints

Type hints in Python provide a way to indicate the expected data types of variables, function parameters, and return values. They help improve code readability and enable better static analysis by tools like linters and IDEs.

However, they have no actual power at runtime and are not enforced by the Python interpreter. They serve as a guideline for developers and can be checked using static type checkers like mypy.

I liken them to "documentation for your code." Just as documentation helps users understand how to use your code, type hints help developers understand what types of values are expected.



In [None]:
# let's see one more example with simple add function
def add(a: int, b: int) -> int:
    return a + b

# so this function expects only integers and returns integer

# however it will work with any values that support adding



In [None]:
print(add(2,2))
print(add(3.14, 2.71))
print(add("Valdis", " RTU"))

# so again type hints are useful (AI will happily make them for us)
# but in Python they only serve as a guiding light

4
5.85
Valdis RTU


In [None]:
# we can still get error if we try to add str with integer which is not allowed in Pythong...
# print(add("nevar", 66))

In [None]:
# let's make a multiplication function that takes two integers and returns an integer
def multiply(a: int, b: str = "x") -> int:
    return a * b

print(multiply(2,3))
print(multiply(3.1415926, 44444))
print(multiply("Beer ", 5)) # funnily Python lets us multiply Strings with integers
print("*"*80) # i pass * explicitly
print(multiply(20)) # here x is implicit as value for b because it is default

6
139624.9415144
Beer Beer Beer Beer Beer 
********************************************************************************
xxxxxxxxxxxxxxxxxxxx


### Function to download and unzip file from url

Below is a more complicated function that does two things: it downloads a zip file from a given URL and then extracts its contents to a specified directory.

Theoretically speaking it would be better to have two functions one that downloads the file and another that extracts it.

In general functions should do one thing and do it well.

When you need to do more than one thing you can combine them into a single function, but be mindful of keeping the function focused and not overly complex.


In [None]:
text = "A quick brown fox     jumped over a sleeping dog"
words = text.split() # split creates a list of strings from string
# by default it splits by any whitespace including newlines \n, tabs \t, and of course spaces
print(words)
comma_text = "Valdis, Līga, Maija, Rūta, Ede"
comma_words = comma_text.split(",")
print(comma_words)

['A', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'sleeping', 'dog']
['Valdis', ' Līga', ' Maija', ' Rūta', ' Ede']


In [None]:
def download_and_unzip(url: str, target_folder: str | Path = 'sample_data') -> Path:
    """
    Downloads a ZIP file from the given URL and extracts it to the target folder.
    Returns the path to the folder where files were extracted.
    """
    target = Path(target_folder)  # Make sure target is a Path object
    target.mkdir(parents=True, exist_ok=True)  # Create the folder if it doesn't exist
    filename = url.split('/')[-1]  # Get the file name from the URL
    # in case of URL the file name is the last one (so index -1 means last one in a list)
    # next we create full path where we will save the zip file
    zip_path = target / filename  # Full path to save the ZIP file
    # check if library exists
    if requests is None:
        raise RuntimeError('requests required.')
    # Download the file in chunks (good for large files)
    with requests.get(url, stream=True, timeout=60) as r:
        r.raise_for_status()  # Raise an error if download failed
        with open(zip_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    # Unzip the downloaded file
    with zipfile.ZipFile(zip_path, 'r') as zf:
        zf.extractall(target) # this command given a zip file as target will unzip ALL files
    # Optionally, remove the ZIP file after extraction
    # zip_path.unlink(missing_ok=True)  # Remove the ZIP file similar to moving to recycle bin
    return target  # Return the folder where files were extracted


**Practice dataset for Part 1:** `latvia_meteo_1925_messy.zip` (5 text files)

- URL: https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_messy.zip

In [None]:
# let's download the messy files zip
url = "https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_messy.zip"
print(f"Will download and unzip from following url: {url}")
# let's download the file and extract it under day_2_data
download_and_unzip(url, Path("day_2_data"))

Will download and unzip from following url: https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_messy.zip


PosixPath('day_2_data')

## Exploring Folder content

In [None]:
text_files = sorted(Path("day_2_data").glob("*.txt")) # so I look for all files that end with txt extension
print(text_files)
# so glob is kind of like file search in Windows Explorere
for p in text_files: # so we iterate/loop over all file paths, calling each one with name p - i could have used a longer more descriptive name such as file_name, my_file, file etc etc
    print(p)

[PosixPath('day_2_data/aluksne_1925.txt'), PosixPath('day_2_data/liepaja_1925.txt'), PosixPath('day_2_data/mersrags_1925.txt'), PosixPath('day_2_data/riga_university_1925_p1.txt'), PosixPath('day_2_data/riga_university_1925_p2.txt')]
day_2_data/aluksne_1925.txt
day_2_data/liepaja_1925.txt
day_2_data/mersrags_1925.txt
day_2_data/riga_university_1925_p1.txt
day_2_data/riga_university_1925_p2.txt


In [None]:
# let's add Google Colab specific code that will offer download to your computer of all the files extracted
# this will not work locally since you already have these files locally!
from google.colab import files # this is Google Colab specific
# if you were local you would already have these files locally
# so we will loop through all files with txt extension in day_2_data and download them
for p in sorted(Path("day_2_data").glob("*.txt")): # glob looks in current folder
    files.download(p) # so this command will run for each file path - we call it p here

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 📂 File Handling

First let me show you how to read whole text file into one big text string

In [None]:
# let's make a function that takes a file Path or string and optional encoding with default utf-8 and returns text string
def get_file_contents(path: Path | str, encoding: str = 'utf-8') -> str:
    """
    Reads the contents of a text file and returns it as a string.
    """
    # so path could be string or Path
    # file could have any extension but it should contain text of some sort
    with open(path, mode='r', encoding=encoding) as f:
        # file stream is open here under f variable which has type IOStream(sic)
        content = f.read() # so we came up with name content for our inner variable
        # file is still open even though we have moved the reading "head" to the end
    # here file is automatically closed after the with block
    return content

In [None]:
# let's read aluksne into memory
# aluksne = get_file_contents(Path("day_2_data/aluksne_1925.txt"))
aluksne = get_file_contents("day_2_data/aluksne_1925.txt")
print(aluksne[:300]) # let's print first 300 characters
# note above is same as print(aluksne[0:300]) # 0 is default

# station_name=Alūksne
# period=(Oct–Dec)
# separator_hint=|
# columns_in_this_file= notes | present_weather_code | t_max_c | precip_24h_mm | date | precip_type | t_min_c
# note: some values intentionally messy (units, words, missing, time in date)
# fields=notes|present_weather_code|t_max_c|precip_


In [None]:
# print last 200 characters
print(aluksne[-200:]) # Python offers two types of indexing
# positive from 0 to len(iterable)-1
# negative from -len(iterable) to -1

0.2 mm|12-18-1925|snow|-7.6
Pūtis brāzmas|82|5.3||11-18-1925||-0.4
NA|63|3.4||12-19-1925|M|-6.0
Novērojums vēlāk|90|-0.2|2.3|12-22-1925|mixed|-8.2
Daļēji mākoņains|90|-3.5|0.2 mm|12-16-1925|rain|-4.0



![Python indexing](https://developers.google.com/static/edu/python/images/hello.png)

## Simple string methods

In [None]:
# we can get count of word "brāzmas" in content
aluksne.count("brāzmas") # count is a string method

4

In [None]:
# we could ask if rain is in aluksne
# this is so called existence check
"rain" in aluksne # so ancient Aluksnians wrote in English... :)

True

In [None]:
# we could replace some text in this content
# however strings are immutable - unchangable
# so to change we would have to overwrite or have new variable
# let's have new text with replacement of Mērījums apstiprināts to Viss OK
aluksne_ok = aluksne.replace("Mērījums apstiprināts", "Viss OK")
# Note replace DOES NOT modify original text
# we have to supply a new variable to save this changed text
# if I wanted aluksne modified I would have written as follows
# aluksne = aluksne.replace("Mērījums apstiprināts", "Viss OK")
print(aluksne_ok)

# station_name=Alūksne
# period=(Oct–Dec)
# separator_hint=|
# columns_in_this_file= notes | present_weather_code | t_max_c | precip_24h_mm | date | precip_type | t_min_c
# note: some values intentionally messy (units, words, missing, time in date)
# fields=notes|present_weather_code|t_max_c|precip_24h_mm|date|precip_type|t_min_c
—|53|5.9|3.0 mm|10-19-1925|R|5.5
Rīts auksts|-999|9.1|2.3 mm|18.10.1925|mixed|4.2
Daļēji mākoņains|65|5.4|0.0|11-10-1925||-0.5
Daļēji mākoņains|50|15.6|0.5|10-07-1925 14:00|S|10.4
Neliels vējš|63|-1.1|0.0|12-28-1925|—|—
Vēlāk sāka līt|-999|5.3||12-07-1925 19:00|none|
Pūtis brāzmas|82|4.2|0.6 mm|11-07-1925|S|0.1
—|61|-7.2|1.4|12-03-1925|mixed|-6.5
Laikam migla|51|3.3|1.0 mm|11-27-1925|mixed|-1.3
Mērīšanas kļūda?|70|-3.5999999999999996|2.1|12-06-1925|M|-2.9
Skursteņi kūp|71|6.2|0.0|10-21-1925||6.9
NA|61|1.3|2.1|28.11.1925|NA|2.0
Ap pusdienlaiku saule|80|8.4|1.9 mm|02-11-1925|snow|-1.5
|53|8.4||10-01-1925 07:00||3.3
Viss OK|53|10.8|0.7 mm|10-08-1925|S|5.3
Rīts au

### Iterating over file line at a time

Above example showed how we could read a whole file into memory.
However that means we would be working with file as one big string as a whole. We could do some replace operations.

Much more often we will want to work on file one line(row) at a time.

### 🔄 For Loops

For loops in Python let us iterate over a sequence (like a list, tuple, or string, or other iterables such as lines in a file) and perform an action for each item in that sequence.

General syntax of for loops is:

```Python
for element in <iterable>:
    <action>
    <more optional action>
```

Note the indentation, as usual in Python after : we have indentation to indicate the block of code that belongs to the for loop.

In [None]:
def count_lines(path: Path | str) -> int:
    """
    Counts the total number of lines and the number of non-empty lines in a file.
    Returns a tuple: (total_lines, nonempty_lines)
    """
    total = 0 # we start our counter with 0, very common pattern in software development
    # this total is local to the count_lines function
    with open(path, 'r', encoding='utf-8', errors='replace') as f:
        for line in f: # so f is a filestream with this for loop we go through it line at a time
            total += 1 # remember this is same as total = total + 1
            # we could do something more here with the line but here we just count the contents
    # important that file is here closed automatically
    return total # Return the line count

In [None]:
# let's count lines in aluksne_1925
count_lines(Path("day_2_data/aluksne_1925.txt"))

42

In [None]:
# Path is not strictly necessary , Path is used to avoid cross OS errors between Mac,Windows, Linux
count_lines("day_2_data/aluksne_1925.txt")

42

In [None]:
# let's count mersrags
count_lines("day_2_data/mersrags_1925.txt")

54

In [None]:
# let's write a function that takes a file path for text file
# second paramater will be destination folder where to write cleaned file
# we want to go line by line and save only lines that starts_with # as default
def clean_file(file, dst, prefix = "#", encoding="utf-8"):
    # first let's convert file to Path just in case
    file = Path(file)
    with open(file, mode='r', encoding=encoding, errors='replace') as f:
      # let's create dst folder if it does not exist
      dst = Path(dst) # just in case
      # now we can make folder
      dst.mkdir(exist_ok=True) # if folder exists do nothing
      # let's define save file as dst plus file.stem plus cleaned plus .txt
      dst_file = dst / (file.stem + "_cleaned.txt") # dst is Path object
      with open(dst_file, mode='w', encoding=encoding, errors='replace') as g: # this file does not have to exist, it will be created automatically
            for line in f: # so we go row by row in original file
              if line.startswith(prefix): # so prefix could be anything you want
              # you can adjust logic in above line to whatever fits your use case
                g.write(line)


In [None]:
# so let's try saving some data from aluksne_1925.txt
clean_file("day_2_data/aluksne_1925.txt", "day_2_cleaned")

## Determining the separator character

Our files have helpfully provided a separator hint without which it would be very hard to determine.

`# separator_hint=|`
`# separator_hint=TAB`
`# separator_hint=,`

Let's write a function that extracts hint from the file


In [None]:
# "Alice Bob     Carol   ".split("Bob")[-1]
"Alice Bob     Carol   ".split("Bob")[-1].strip()

'Carol'

In [None]:
# let's use regular expression to extract any letters after Bob
# in string such as "Alice Bob     Carol   " I want to return Carol only
# import re # regular expressions
# haystack = "Alice Bob     Carol   "
# needle = "Bob"
# my_match = re.search(needle, haystack)
# print(my_match)
# TODO find a matching group

In [None]:
def get_sep(path: Path) -> str:
    """
    We look for first line that contains
    `# separator_hint=|`
    `# separator_hint=TAB`
    `# separator_hint=,`
    """
    needle = "separator_hint="
    sep = None # we start with assumption that we do not know the separator
    # lets open file and go through line by line
    with open(path, 'r', encoding='utf-8', errors='replace') as f:
        for line in f: # so go row by row
            if needle in line: # we check for presence of needle
            # we could have used regular expression but no need here
                # let's find whatever is after needle
                # so we split by needle at take last part
                sep = line.split(needle)[-1].strip() # strip removes all whitespace from both sides of string
    # file is automatically closed here
    # if sep is TAB we need to return \t
    if sep == "TAB": # special case TAB which we need to convert
        return "\t" # so \t is a way of representing tab in Python and many other languages
        # similarly \n is a newline symbol
        # so then \\ is \ symbol
    return sep

In [None]:
# let's see if we can find separator for aluksne_1925
get_sep(Path("day_2_data/aluksne_1925.txt"))

'|'

In [None]:
# let's find it for all txt files
for p in sorted(Path("day_2_data").glob("*.txt")):

    # we could have used bad as well
    print(p, get_sep(p))

day_2_data/aluksne_1925.txt |
day_2_data/liepaja_1925.txt ,
day_2_data/mersrags_1925.txt 	
day_2_data/riga_university_1925_p1.txt ;
day_2_data/riga_university_1925_p2.txt ;


## glob vs. rglob

Using Path I can use glob or rglob to find files matching my pattern

glob looks in specific directory only

rglob looks in specific directory and ALL subfolders and their subfolders and so on forever

rglob is quite similar to how Windows Explorer looks for files

In [None]:
# let's use rglob to look for all text files in current folder and its subfolders and so on
text_files = sorted(Path(".").rglob("*.txt")) # . means current folder
# rglob actually returns and iterator not list
# sorted returns a sorted by name list
for p in text_files:
    print(p)

day_2_cleaned/aluksne_1925_cleaned.txt
day_2_data/aluksne_1925.txt
day_2_data/liepaja_1925.txt
day_2_data/mersrags_1925.txt
day_2_data/riga_university_1925_p1.txt
day_2_data/riga_university_1925_p2.txt


In [None]:
# now let's find all csv files
csv_files = sorted(Path(".").rglob("*.csv"))
for p in csv_files:
    print(p)

sample_data/california_housing_test.csv
sample_data/california_housing_train.csv
sample_data/mnist_test.csv
sample_data/mnist_train_small.csv


In [None]:
# note difference if we gave the actual subfolder
txt_files = sorted(Path("day_2_data").rglob("*.txt"))
for p in txt_files:
    print(p)

day_2_data/aluksne_1925.txt
day_2_data/liepaja_1925.txt
day_2_data/mersrags_1925.txt
day_2_data/riga_university_1925_p1.txt
day_2_data/riga_university_1925_p2.txt


## Hand loading a DataFrame

For single cases we could do okay by hand, meaning we use
Pandas read_csv and apply some parameters to skip some rows, use custom separator etc.

In [None]:
# so let's load aluksne
# we know the following it uses | as separator
# first 6 rows contain metadata that we want to skip
# we want column names to be generic
aluksne_df = pd.read_csv(Path("day_2_data/aluksne_1925.txt"),
                         sep="|",
                         skiprows=6,
                         header=None)
aluksne_df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,—,53.0,5.9,3.0 mm,10-19-1925,R,5.5
1,Rīts auksts,-999.0,9.1,2.3 mm,18.10.1925,mixed,4.2
2,Daļēji mākoņains,65.0,5.4,0.0,11-10-1925,,-0.5
3,Daļēji mākoņains,50.0,15.6,0.5,10-07-1925 14:00,S,10.4
4,Neliels vējš,63.0,-1.1,0.0,12-28-1925,—,—


## Supported Text Encoding by Python

Usually we want utf-8, but you might have some older text files that you might want to process.

* https://docs.python.org/3/library/codecs.html#text-encodings

In [None]:
# now let's read row 6 from aluksne_1925.txt, it contains needle "fields="
# we want to split by this needle
# then the last part we want to split by pipe |
# those will be the column names
with open(Path("day_2_data/aluksne_1925.txt"), 'r', encoding='utf-8', errors='replace') as f:
    for line in f:
        if "fields=" in line:
            column_names = line.split("fields=")[-1].strip().split("|") # here we know the separator | but in general we need to find sep first
            break # we assume there will be no more fields= we already got our column_names

column_names

['notes',
 'present_weather_code',
 't_max_c',
 'precip_24h_mm',
 'date',
 'precip_type',
 't_min_c']

In [None]:
# let's check length of df columns and how many column_names we have
print(len(aluksne_df.columns))
print(len(column_names))
# print ready for assignment message if lengths match
if len(aluksne_df.columns) == len(column_names):
    print("Lengths match") # we are good to go here!
else:
    print("Lengths do not match") # we need to think why thye did not match

7
7
Lengths match


In [None]:
# now let's apply those column names to aluksne_df
# only requirement that lengths of columns match,
# that is we have same number of columns
aluksne_df.columns = column_names
aluksne_df.head()

Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c
0,—,53.0,5.9,3.0 mm,10-19-1925,R,5.5
1,Rīts auksts,-999.0,9.1,2.3 mm,18.10.1925,mixed,4.2
2,Daļēji mākoņains,65.0,5.4,0.0,11-10-1925,,-0.5
3,Daļēji mākoņains,50.0,15.6,0.5,10-07-1925 14:00,S,10.4
4,Neliels vējš,63.0,-1.1,0.0,12-28-1925,—,—


### 📥 Loading Cleaned Files into DataFrames

In [None]:
# Let us write a function that will take a file path
# This function will go through file line by line
# If line with separator_hint= is found we save in sep value the last part of this line after stripping whitespace
# any other line that starts with hash (#) is ignored
# all other lines are split using sep value and stored as a list of lists (2d)
# once all lines are read the 2d list is converted to Dataframe and returned
def load_messy_file(path: Path | str) -> pd.DataFrame:
    sep = None
    lines = [] # empty list (saraksts Latviski), array(masīvs) in other languages
    # we will use lines to pass later to DataFrame constructor
    columns = []
    # with guarantees closing of file at the end
    with open(path, mode='r', encoding='utf-8', errors='replace') as f:
        # we loop through file one row at a time
        for line in f:
            if line.startswith('# separator_hint='):
                # extract separator by splitting by hint and then taking last item
                sep = line.split('separator_hint=')[-1].strip()
                # add special case TAB -> \t
                if sep == "TAB":
                    sep = "\t"
            if "# fields=" in line:
              # we split by "fields=" then split last part by sep giving us columns
              raw_fields = line.split("fields=")[-1].strip() # we take last part
              columns = raw_fields.split(sep) # if we split by None we will get error
            if line.startswith("#"): # order is important we check this AFTER hint check
                continue # we go to next line
            # this means here we have a good line that should be usable for data
            if sep:
                # we split the row by separator and add it to our list of lines
                lines.append(line.split(sep)) # we create a new list
                # and append this new list to our existing
                # in effect we are creating a 2D matrix - 2d list here
    # Here file is closed - that's good and safe!
    # all that remains is to create a dataframe from our lines
    df = pd.DataFrame(lines) # we leave the challenge of creating appropriate names for later
    # let's check if column count in df corresponds to length of our columns
    if len(columns) == len(df.columns):
        # if it matches we apply our columns
        df.columns = columns
    # so above idea is to use our columns only when our shape matches
    return df

In [None]:
# let's try above function with aluksne_1925.txt
df = load_messy_file(Path("day_2_data/aluksne_1925.txt"))
df.head()

Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c
0,—,53,5.9,3.0 mm,10-19-1925,R,5.5\n
1,Rīts auksts,-999,9.1,2.3 mm,18.10.1925,mixed,4.2\n
2,Daļēji mākoņains,65,5.4,0.0,11-10-1925,,-0.5\n
3,Daļēji mākoņains,50,15.6,0.5,10-07-1925 14:00,S,10.4\n
4,Neliels vējš,63,-1.1,0.0,12-28-1925,—,—\n


In [None]:
text_files = sorted(Path("day_2_data").glob("*.txt"))
text_files

[PosixPath('day_2_data/aluksne_1925.txt'),
 PosixPath('day_2_data/liepaja_1925.txt'),
 PosixPath('day_2_data/mersrags_1925.txt'),
 PosixPath('day_2_data/riga_university_1925_p1.txt'),
 PosixPath('day_2_data/riga_university_1925_p2.txt')]

In [None]:
df_dictionary = {} # dictionary is a data structure in Python
# other languages call this hashtable, associative array
# idea is that from unique key we can find very quickly the value
# which uses key -> value, so unique keys point to any type of value
# typically keys are strings, but technically any hashable key can be used
for file in text_files:
    df_dictionary[file.stem] = load_messy_file(file) # so we assign to each stem of file name a DataFrame created from that file
# how many dataframes
print(len(df_dictionary))

5


In [None]:
# let's go through all dataframes, print the file name and display head(3)
for k, v in df_dictionary.items(): # items gives us iterable tuples of key and value commonly we use k,v for short
# here k == file_name stem (no extension)
# v == dataframe
    print(k)
    display(v.head(3))

aluksne_1925


Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c
0,—,53,5.9,3.0 mm,10-19-1925,R,5.5\n
1,Rīts auksts,-999,9.1,2.3 mm,18.10.1925,mixed,4.2\n
2,Daļēji mākoņains,65,5.4,0.0,11-10-1925,,-0.5\n


liepaja_1925


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,13,2,1925/05/26,none,75,0,0 mm,-999,Mērīšanas kļūda?\n,
1,24,4,07.06.1925,rain,75,0,6,16,8,Neliels vējš\n
2,12,600000000000001,1925/05/20,S,50,0,4 mm,13,3,Pūtis brāzmas\n


mersrags_1925


Unnamed: 0,date,precip_24h_mm,precip_type,notes,t_min_c,present_weather_code,t_max_c
0,15-03-1925,0.0,none,Mērījums apstiprināts,-3.7,70,-4.4\n
1,04-04-1925 14:00,0.6,snow,Ap pusdienlaiku saule,4.7,75,7.9\n
2,22-05-1925,0.1,,,7.7,80,11.9\n


riga_university_1925_p1


Unnamed: 0,date,t_min_c,t_max_c,present_weather_code,precip_type,precip_24h_mm,notes
0,1925-02-11,—,1.3,73,,0.0,—\n
1,1925-01-06,-4.7,0.8,90,none,0.1 mm,Mērīšanas kļūda?\n
2,1925-02-07,-8.3,-7.2,71,none,0.0 mm,Rīts auksts\n


riga_university_1925_p2


Unnamed: 0,present_weather_code,date,precip_24h_mm,t_max_c,t_min_c,precip_type,notes
0,53.0,18.10.1925 20:00,0.7,10.5,6.0,snow,—\n
1,,1925/10/13,0.0,9.7,8.1,—,\n
2,50.0,16.09.1925,1.3,17.3,14.8,rain,Mērīšanas kļūda?\n


In [None]:
# okay let's write a function that takes src folder containing text files
# function takes also dst folder for output of xlsx files that will be obtained by saving Dataframes from text files
def create_xlsx_files(src: Path | str, dst: Path | str):
    # first read text files in src
    src = Path(src) # to make sure that src is actually Path
    # using Path objects lets us use glob below
    text_files = sorted(src.glob('*.txt'))

    # create dst if it does not exist
    dst = Path(dst)
    dst.mkdir(exist_ok=True)
    # now simply loop through text_files and create Dataframe and save in dst as text file with xlsx suffix
    for p in text_files:
        df = load_messy_file(p) # note this df is local to this function, outside df is not affected
        # df.to_excel(dst / (p.stem + ".xlsx"))
        # we want to save without index
        df.to_excel(dst / (p.stem + ".xlsx"), index=False)

### Saving converted files into XLSX

Now that we have needed functions we can use them to convert text files to XLSX format.

Of course the text files should match the specific format (have separator hint and naturally have lines of content separated by that separtor

In [None]:
# let'' take day_2_data text files
# and convert them to xlsx
# let's use new folder for that
# let's call it day_2_xlsx
INPUT_DIR = "day_2_data" # note ALL_CAPS is not required, it just shows that these are configurable variables
OUTPUT_DIR = "day_2_xlsx"

create_xlsx_files(INPUT_DIR, OUTPUT_DIR)

In [None]:
# let us download all files from OUTPU_DIR
from google.colab import files
for p in sorted(Path(OUTPUT_DIR).glob("*.xlsx")):
    files.download(p)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 2: Guided Exercise — Latvia Weather Data (Extra Messy)

**Duration:** ~30 minutes  
**Dataset:** `latvia_meteo_1925_extra_messy.zip`  
**URL:** https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_extra_messy.zip

### 🎯 Objective
Convert multiple extra-messy weather text files   into XLSX files into a new folder of your choice

### Hint

Simply run the provided functions with folders of your choice.

### Advanced Users

Advanced users with some Pandas experience can work on checking for missing values and even try merging or concatenating the dataframes.

In [None]:
# --- SKELETON (students fill in) ---
EXTRA_URL = 'https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_extra_messy.zip'
DATA_DIR = Path('day_2_exercise')

# 1) Download & unzip
download_and_unzip(EXTRA_URL, DATA_DIR)

PosixPath('day_2_exercise')

In [None]:
# --- SKELETON (students fill in) ---
EXTRA_URL = 'https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/latvia_meteo_1925_extra_messy.zip'
DATA_DIR = Path('day_2_exercise')

# 1) Download & unzip
download_and_unzip(EXTRA_URL, DATA_DIR)

# # 2) Print contents of DATA DIR

data_files = sorted(DATA_DIR.glob('*.txt'))
for file_name in data_files: # file_names is just a variable name
  print(file_name)

# 2) Inspect: list files & counts
for p in sorted(DATA_DIR.glob('*.txt')):
    print(p.name, '->', count_lines(p))



day_2_exercise/dobele_1925.txt
day_2_exercise/mersrags_extra_1925.txt
day_2_exercise/pavilosta_1925.txt
dobele_1925.txt -> 54
mersrags_extra_1925.txt -> 52
pavilosta_1925.txt -> 56


In [None]:
# # let us test on day_2_exercise folder
INPUT_FOLDER = Path("day_2_exercise") # this should actually exist
OUTPUT_FOLDER = Path("day_2_exercise_output") # this can be any valid folder name
create_xlsx_files(INPUT_FOLDER, OUTPUT_FOLDER)

In [None]:
# let us download all files from OUTPUT_FOLDER
# from google.colab import files # not necessary if you already imported before
for p in sorted(Path(OUTPUT_FOLDER).glob("*.xlsx")):
    files.download(p)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# let us create a function that takes downloads files from Google Colab to local computer
# this function takes two arguments
# first argument is folder where files reside
# second argument is extension that files should match
def download_from_colab(folder=".", extension=".xlsx"):
  for p in sorted(Path(folder).rglob(f"*{extension}")):
    files.download(p)

In [None]:
# now I will call this function without any arguments so will use all defaults
download_from_colab()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 🧪 Checkpoints
- At least **N≥3** cleaned files successfully load into DataFrames.
- No parsing exceptions on `.head()` or `.info()`.
- You can explain (in comments) which rules your `is_good_line` used.

### 🛠 Extension (Optional)
- Write a variant `clean_files(folder, out_dir=Path('data/cleaned'))` that writes outputs into a subfolder.
- Add a **regex-based** `is_good_line_regex` that only keeps lines starting with `YYYY-MM-DD`.

## Part 2b: Extracting column names automatically

We were able to extract the data from the files and load them into DataFrames and then save them into XLSX files.

One negative is that we do not have column names just numbers.

Thus we would like to extract column names automatically

## Part 3: Pandas-Specific Data Cleaning

Now we will focus on cleaning the data using pandas.



### Overview
In this section, you will standardize each DataFrame from Part 2 so they share a **common schema** and are ready to merge.

### Target Schema (example)
- `date` (datetime)
- `station` (string/category)
- `t_min` (float)
- `t_max` (float)
- `precip` (float)

### Typical Operations
1. **Column detection & renaming** – bring different column names to a shared set
2. **Type coercion** – numbers via `pd.to_numeric(errors='coerce')`, dates via `pd.to_datetime(errors='coerce')`
3. **Missing values** – `dropna` or `fillna` depending on context
4. **Duplicates** – `.duplicated()` + `.drop_duplicates()`
5. **Categoricals** – normalize text (`strip`, `title`, `upper`) and `astype('category')` if useful
6. **Validation** – quick assertions (e.g., date not null, temperature ranges plausible)

### Step-by-Step Guide
1) **Pick one DataFrame** from `dfs_extra` and print `.head()`, `.columns`, `.info()`
2) **Map columns** to target names (e.g., `temp_min` → `t_min`)
3) **Coerce**:
   - `date = pd.to_datetime(df['date'], errors='coerce')`
   - `df[['t_min','t_max','precip']] = df[['t_min','t_max','precip']].apply(pd.to_numeric, errors='coerce')`
4) **Handle missing**: start conservative (e.g., drop rows missing `date` or all temperature columns)
5) **Standardize station names**: `df['station'] = df['station'].astype(str).str.strip().str.title()`
6) **Check duplicates** and remove
7) **Repeat** for all DataFrames

### Common Pitfalls & Tips
- Treat ambiguous `-` or `NA` strings as missing (`na_values=["-","NA","N/A"]` if you re-read with `read_csv`)
- Some files might have **merged columns**; split using `.str.split(',', expand=True)` when necessary
- If a file lacks a column, create it with `pd.NA` so the schema lines up later

### 🧱 Skeleton: Inspect & Rename

In [None]:
# let us load xlsx files currently available
# xlsx_files = sorted(Path("day_2_xlsx").glob("*.xlsx"))
# i want ALL XLSX files currently available in all subfolders
xlsx_files = sorted(Path(".").rglob("*.xlsx"))
dfs_extra = {}
for p in xlsx_files:
    dfs_extra[p.stem] = pd.read_excel(p)


In [None]:
## let's see what names we have for our keys
print(dfs_extra.keys())

dict_keys(['dobele_1925', 'mersrags_extra_1925', 'pavilosta_1925', 'aluksne_1925', 'liepaja_1925', 'mersrags_1925', 'riga_university_1925_p1', 'riga_university_1925_p2'])


In [None]:
# how many key - value pairs?
print(len(dfs_extra.keys()))

8


In [None]:
# let us look at first df in dfs_extra
df = dfs_extra['aluksne_1925'] # this is how I get value from key
df.head()

Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c
0,—,53.0,5.9,3.0 mm,10-19-1925,R,5.5\n
1,Rīts auksts,-999.0,9.1,2.3 mm,18.10.1925,mixed,4.2\n
2,Daļēji mākoņains,65.0,5.4,0.0,11-10-1925,,-0.5\n
3,Daļēji mākoņains,50.0,15.6,0.5,10-07-1925 14:00,S,10.4\n
4,Neliels vējš,63.0,-1.1,0.0,12-28-1925,—,—\n


In [None]:
# let's look at liepaja_1925'
df = dfs_extra['liepaja_1925']
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,13,2,1925/05/26,none,75,0,0 mm,-999,Mērīšanas kļūda?\n,
1,24,4,07.06.1925,rain,75,0,6,16,8,Neliels vējš\n
2,12,600000000000001,1925/05/20,S,50,0,4 mm,13,3,Pūtis brāzmas\n
3,15,8,1925/06/16,,73,0,2,10,8,Skursteņi kūp\n
4,21,9,1925/06/25 07:00,—,81,0,0 mm,15,4,NA\n


In [None]:
# let's remove liepaja_1925 from our dictionary
del dfs_extra['liepaja_1925'] # this drops the key-value pair
# how many keys we have now? should have 7
print(len(dfs_extra.keys()))

7


In [None]:
# let's check if all remaining dataframes have same columns (not necessarily in same order)
column_dictionary = {}
for k, v in dfs_extra.items():
    column_dictionary[k] = set(v.columns.tolist()) # set creates a collection of unique items from list or any other iterable
# keys
print(column_dictionary.keys())


dict_keys(['dobele_1925', 'mersrags_extra_1925', 'pavilosta_1925', 'aluksne_1925', 'mersrags_1925', 'riga_university_1925_p1', 'riga_university_1925_p2'])


In [None]:
# let's set first value in column_dictionary as reference and check if we have any differences
unique_columns = column_dictionary['dobele_1925']
for k, v in column_dictionary.items():
    # print differences for each value set with unique_columns
    print(k, unique_columns.difference(v)) # so we see what if any columns are different with reference set of columns

dobele_1925 set()
mersrags_extra_1925 set()
pavilosta_1925 set()
aluksne_1925 set()
mersrags_1925 set()
riga_university_1925_p1 set()
riga_university_1925_p2 set()


In [None]:
# so what are those unique_columns
# let's see them sorted
print(sorted(unique_columns))

['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code', 't_max_c', 't_min_c']


In [None]:
aluksne_df.head()

Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c
0,—,53.0,5.9,3.0 mm,10-19-1925,R,5.5
1,Rīts auksts,-999.0,9.1,2.3 mm,18.10.1925,mixed,4.2
2,Daļēji mākoņains,65.0,5.4,0.0,11-10-1925,,-0.5
3,Daļēji mākoņains,50.0,15.6,0.5,10-07-1925 14:00,S,10.4
4,Neliels vējš,63.0,-1.1,0.0,12-28-1925,—,—


In [None]:
aluksne_df['veikals'] = "RIMI" # all cells in column veikals will have value RIMI
aluksne_df.head()

Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c,veikals
0,—,53.0,5.9,3.0 mm,10-19-1925,R,5.5,RIMI
1,Rīts auksts,-999.0,9.1,2.3 mm,18.10.1925,mixed,4.2,RIMI
2,Daļēji mākoņains,65.0,5.4,0.0,11-10-1925,,-0.5,RIMI
3,Daļēji mākoņains,50.0,15.6,0.5,10-07-1925 14:00,S,10.4,RIMI
4,Neliels vējš,63.0,-1.1,0.0,12-28-1925,—,—,RIMI


In [None]:
# so now let's drop this column
aluksne_df = aluksne_df.drop(columns=['veikals'])
aluksne_df.head()

Unnamed: 0,notes,present_weather_code,t_max_c,precip_24h_mm,date,precip_type,t_min_c
0,—,53.0,5.9,3.0 mm,10-19-1925,R,5.5
1,Rīts auksts,-999.0,9.1,2.3 mm,18.10.1925,mixed,4.2
2,Daļēji mākoņains,65.0,5.4,0.0,11-10-1925,,-0.5
3,Daļēji mākoņains,50.0,15.6,0.5,10-07-1925 14:00,S,10.4
4,Neliels vējš,63.0,-1.1,0.0,12-28-1925,—,—


In [None]:
# firs let's drop the 'Unnamed: 0' column from all dataframes in our dfs_extra dataframes
column_to_drop = 'Unnamed: 0'
for k, v in dfs_extra.items(): # so k is file stem, and v is the actual dataframe
    print(f"Dropping {column_to_drop} from {k} dataframe shaped {dfs_extra[k].shape}")
    if column_to_drop in dfs_extra[k].columns: # before I "shoot" I check whether there is anything worth shooting
        dfs_extra[k] = dfs_extra[k].drop(columns=[column_to_drop])
        # shape after dropping
        print(f"Shape after dropping: {dfs_extra[k].shape}")

Dropping Unnamed: 0 from dobele_1925 dataframe shaped (48, 7)
Dropping Unnamed: 0 from mersrags_extra_1925 dataframe shaped (46, 7)
Dropping Unnamed: 0 from pavilosta_1925 dataframe shaped (50, 7)
Dropping Unnamed: 0 from aluksne_1925 dataframe shaped (36, 7)
Dropping Unnamed: 0 from mersrags_1925 dataframe shaped (48, 7)
Dropping Unnamed: 0 from riga_university_1925_p1 dataframe shaped (36, 7)
Dropping Unnamed: 0 from riga_university_1925_p2 dataframe shaped (36, 7)


In [None]:
# so let's sort the unique columns again and save them in a list
unique_columns = sorted(unique_columns)
# let's see them
print(unique_columns)

['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code', 't_max_c', 't_min_c']


In [None]:
# drop the first entry
# unique_columns.pop(0) # this pops from beginning (Note in very large lists this can be slow)
# show again
print(unique_columns)

['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code', 't_max_c', 't_min_c']


In [None]:
# so let's use unique_columns order for all dataframes
for k, v in dfs_extra.items():
    dfs_extra[k] = v[unique_columns] # in order for this to work the columns should exist
    # print columns
    print(k, dfs_extra[k].columns)


dobele_1925 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
       't_max_c', 't_min_c'],
      dtype='object')
mersrags_extra_1925 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
       't_max_c', 't_min_c'],
      dtype='object')
pavilosta_1925 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
       't_max_c', 't_min_c'],
      dtype='object')
aluksne_1925 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
       't_max_c', 't_min_c'],
      dtype='object')
mersrags_1925 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
       't_max_c', 't_min_c'],
      dtype='object')
riga_university_1925_p1 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
       't_max_c', 't_min_c'],
      dtype='object')
riga_university_1925_p2 Index(['date', 'notes', 'precip_24h_mm', 'precip_type', 'present_weather_code',
      

In [None]:
# now we can concatante all 7 dataframes in one big dataframe
# first we want to add column file_name using key for all dataframes so we do not lose reference later
for k, v in dfs_extra.items():
    dfs_extra[k]['file_name'] = k
# print aluksne_1925 head
dfs_extra['aluksne_1925'].head()

Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name
0,10-19-1925,—,3.0 mm,R,53.0,5.9,5.5\n,aluksne_1925
1,18.10.1925,Rīts auksts,2.3 mm,mixed,-999.0,9.1,4.2\n,aluksne_1925
2,11-10-1925,Daļēji mākoņains,0.0,,65.0,5.4,-0.5\n,aluksne_1925
3,10-07-1925 14:00,Daļēji mākoņains,0.5,S,50.0,15.6,10.4\n,aluksne_1925
4,12-28-1925,Neliels vējš,0.0,—,63.0,-1.1,—\n,aluksne_1925


In [None]:
# finally we can concatanate all the dataframes into one big dataframe
df = pd.concat(dfs_extra.values(), ignore_index=True) # values are the dataframes, we do not need keys(file_names) any more
# shape after concatenation
print(f"Shape after concatenation: {df.shape}")
# random sample
df.sample(5)

Shape after concatenation: (300, 8)


Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name
142,1925.04.05 08:00,Vēlāk sāka līt\n,nulle,none,65.0,5.3,-6.2,pavilosta_1925
264,18.10.1925 20:00,—\n,0.7,snow,53.0,10.5,6.0,riga_university_1925_p2
112,1925.03.23,Daļēji mākoņains\n,nulle,none,90.0,0.5,-7.5,pavilosta_1925
125,1925.05.01,Laikam migla\n,0.4 mm,S,90.0,9.3,-999,pavilosta_1925
146,11-10-1925,Daļēji mākoņains,0.0,,65.0,5.4,-0.5\n,aluksne_1925


In [None]:
# let's save the big dataframe without index
df.to_excel("day_2_xlsx/all_stations_without_liepaja_cleaned.xlsx", index=False)


In [None]:
# now let's download to local computer
from google.colab import files
files.download("day_2_xlsx/all_stations_without_liepaja_cleaned.xlsx")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Loading All Stations From Excel

URL - https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/all_stations_without_liepaja_cleaned.xlsx

In [2]:
all_stations_url = "https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/all_stations_without_liepaja_cleaned.xlsx"
print(f"Downloading {all_stations_url}")
df = pd.read_excel(all_stations_url)
print(f"Shape after loading all stations: {df.shape}")
df.head()

Downloading https://github.com/ValRCS/RTU_Data_Analysis_Visualization_CPD/raw/refs/heads/main/data/all_stations_without_liepaja_cleaned.xlsx
Shape after loading all stations: (300, 8)


Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name
0,22-08-1925,Laikam migla,12,rain\n,75,160,85.0,dobele_1925
1,1925/08/03,Skursteņi kūp,35,snow\n,60,185,192.0,dobele_1925
2,24-08-1925 14:00,Pūtis brāzmas,"0,0 mm",snow\n,50,176,,dobele_1925
3,1925/05/08,Novērojums vēlāk,00,none\n,81,177,59.0,dobele_1925
4,11-07-1925 07:00,Neliels vējš,03,mixed\n,51,199,161.0,dobele_1925


In [3]:
assert False # we want to stop here

AssertionError: 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   date                  300 non-null    object
 1   notes                 262 non-null    object
 2   precip_24h_mm         283 non-null    object
 3   precip_type           241 non-null    object
 4   present_weather_code  271 non-null    object
 5   t_max_c               277 non-null    object
 6   t_min_c               282 non-null    object
 7   file_name             300 non-null    object
dtypes: object(8)
memory usage: 18.9+ KB


## Goals For Type conversion

We want to convert as many columns as possible to correct data types
such as t max and t min into floats and datae into special datetime index

In [5]:
# let us strip all alpha characters from precip_24h_mm column
df['precip_24h_mm'] = df['precip_24h_mm'].str.strip('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') # we could add other characters
df.head()

Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name
0,22-08-1925,Laikam migla,12,rain\n,75,160,85.0,dobele_1925
1,1925/08/03,Skursteņi kūp,35,snow\n,60,185,192.0,dobele_1925
2,24-08-1925 14:00,Pūtis brāzmas,0,snow\n,50,176,,dobele_1925
3,1925/05/08,Novērojums vēlāk,0,none\n,81,177,59.0,dobele_1925
4,11-07-1925 07:00,Neliels vējš,3,mixed\n,51,199,161.0,dobele_1925


In [6]:
# we see that some entries in precip_24h_mm are using , instead of .
# let's convert that
df['precip_24h_mm'] = df['precip_24h_mm'].str.replace(',', '.')
df.head()

Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name
0,22-08-1925,Laikam migla,1.2,rain\n,75,160,85.0,dobele_1925
1,1925/08/03,Skursteņi kūp,3.5,snow\n,60,185,192.0,dobele_1925
2,24-08-1925 14:00,Pūtis brāzmas,0.0,snow\n,50,176,,dobele_1925
3,1925/05/08,Novērojums vēlāk,0.0,none\n,81,177,59.0,dobele_1925
4,11-07-1925 07:00,Neliels vējš,0.3,mixed\n,51,199,161.0,dobele_1925


In [7]:
# let's do same for t_max_c and t_min_c
# this means that any not a number mentions that are not standard will end up with blank strings ""
df['t_max_c'] = df['t_max_c'].str.strip('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
df['t_max_c'] = df['t_max_c'].str.replace(',', '.')

df['t_min_c'] = df['t_min_c'].str.strip('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
df['t_min_c'] = df['t_min_c'].str.replace(',', '.')
df.head()

Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name
0,22-08-1925,Laikam migla,1.2,rain\n,75,16.0,8.5,dobele_1925
1,1925/08/03,Skursteņi kūp,3.5,snow\n,60,18.5,19.2,dobele_1925
2,24-08-1925 14:00,Pūtis brāzmas,0.0,snow\n,50,17.6,,dobele_1925
3,1925/05/08,Novērojums vēlāk,0.0,none\n,81,17.7,5.9,dobele_1925
4,11-07-1925 07:00,Neliels vējš,0.3,mixed\n,51,19.9,16.1,dobele_1925


In [8]:
# let us check data types again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   date                  300 non-null    object
 1   notes                 262 non-null    object
 2   precip_24h_mm         283 non-null    object
 3   precip_type           241 non-null    object
 4   present_weather_code  271 non-null    object
 5   t_max_c               135 non-null    object
 6   t_min_c               239 non-null    object
 7   file_name             300 non-null    object
dtypes: object(8)
memory usage: 18.9+ KB


In [9]:
# let's convert precip_24h_mm, t_max_c,	t_min_c columns to numeric
df['precip_24h_mm'] = pd.to_numeric(df['precip_24h_mm'], errors='coerce')
df['t_max_c'] = pd.to_numeric(df['t_max_c'], errors='coerce')
df['t_min_c'] = pd.to_numeric(df['t_min_c'], errors='coerce')
# without coerce the conversion might fail, then we keep cleaning until we can convert

# if we are worried we might lose some potential candidation we could save into new or different columns


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  300 non-null    object 
 1   notes                 262 non-null    object 
 2   precip_24h_mm         245 non-null    float64
 3   precip_type           241 non-null    object 
 4   present_weather_code  271 non-null    object 
 5   t_max_c               126 non-null    float64
 6   t_min_c               230 non-null    float64
 7   file_name             300 non-null    object 
dtypes: float64(3), object(5)
memory usage: 18.9+ KB


In [10]:
# let us convert date column to date type
# in this case I keep original date column because our conversion might fail many times and generate Nan
df['date_converted'] = pd.to_datetime(df['date'], errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   date                  300 non-null    object        
 1   notes                 262 non-null    object        
 2   precip_24h_mm         245 non-null    float64       
 3   precip_type           241 non-null    object        
 4   present_weather_code  271 non-null    object        
 5   t_max_c               126 non-null    float64       
 6   t_min_c               230 non-null    float64       
 7   file_name             300 non-null    object        
 8   date_converted        77 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 21.2+ KB


  df['date_converted'] = pd.to_datetime(df['date'], errors='coerce')


In [11]:
# let us count how many nan values date_converted has
df['date_converted'].isna().sum()

np.int64(223)

In [13]:
# let us see all date cells for rows where date_converted is nan
df[df['date_converted'].isna()][['date', 'date_converted']] # again note double brackets for showing more than one column

Unnamed: 0,date,date_converted
1,1925/08/03,NaT
2,24-08-1925 14:00,NaT
3,1925/05/08,NaT
4,11-07-1925 07:00,NaT
9,06-15-1925,NaT
...,...,...
295,02.10.1925 19:00,NaT
296,12.10.1925,NaT
297,17.11.1925 07:00,NaT
298,15.09.1925,NaT


In [14]:
# convert cells in date column which match DD.MM.YYYY format to date_converted column
print(f"{df['date_converted'].isna().sum()} nan values left in date_converted")
df.loc[df['date'].str.match(r'\d{2}\.\d{2}\.\d{4}'), 'date_converted'] = pd.to_datetime(df[df['date'].str.match(r'\d{2}\.\d{2}\.\d{4}')]['date'], errors='coerce')
# note we use raw strings r"this is raw string" which does not use escaping
# why? because regular expressions also use escaping
# how many nan values  left in date_converted
print(f"{df['date_converted'].isna().sum()} nan values left in date_converted")
# we saved a trip to regex101.com here by using AI assitance

# TODO at home convert the rest of datatime using regex for each different case

223 nan values left in date_converted
210 nan values left in date_converted


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   date                  300 non-null    object        
 1   notes                 262 non-null    object        
 2   precip_24h_mm         245 non-null    float64       
 3   precip_type           241 non-null    object        
 4   present_weather_code  271 non-null    object        
 5   t_max_c               126 non-null    float64       
 6   t_min_c               230 non-null    float64       
 7   file_name             300 non-null    object        
 8   date_converted        90 non-null     datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 21.2+ KB


In [15]:
# let us see all date cells for rows where date_converted is nan
df[df['date_converted'].isna()][['date', 'date_converted']]

Unnamed: 0,date,date_converted
1,1925/08/03,NaT
2,24-08-1925 14:00,NaT
3,1925/05/08,NaT
4,11-07-1925 07:00,NaT
9,06-15-1925,NaT
...,...,...
291,14.09.1925,NaT
294,20.09.1925 20:00,NaT
295,02.10.1925 19:00,NaT
297,17.11.1925 07:00,NaT


In [16]:
# let us convert all dates in date column that look like 1925/08/03 to correct datetime in date_converted column
print(f"{df['date_converted'].isna().sum()} nan values left in date_converted")
df.loc[df['date'].str.match(r'\d{4}/\d{2}/\d{2}'), 'date_converted'] = pd.to_datetime(df[df['date'].str.match(r'\d{4}/\d{2}/\d{2}')]['date'], errors='coerce')
# note we use raw strings r"this is raw string" which does not use escaping
# why? because regular expressions also use escaping
# how many nan values  left in date_converted
print(f"{df['date_converted'].isna().sum()} nan values left in date_converted")


210 nan values left in date_converted
203 nan values left in date_converted


In [17]:
# now we can do describe
df.describe()

Unnamed: 0,precip_24h_mm,t_max_c,t_min_c,date_converted
count,245.0,126.0,230.0,97
mean,-23.619184,-19.555556,-29.070435,1925-06-08 22:01:14.226804224
min,-999.0,-999.0,-999.0,1925-01-11 00:00:00
25%,0.0,5.225,-1.9,1925-04-08 00:00:00
50%,0.4,12.95,6.5,1925-05-26 00:00:00
75%,1.3,18.5,12.275,1925-08-06 00:00:00
max,6.0,26.1,22.3,1925-12-10 00:00:00
std,154.863589,178.211225,184.697402,


In [18]:
# so -999 is an obvious outlier, it actually represents Nan so lets convert all -999 occurences to Nan
df.loc[df['t_min_c'] == -999, 't_min_c'] = pd.NA # pd.NA
df.describe()

Unnamed: 0,precip_24h_mm,t_max_c,t_min_c,date_converted
count,245.0,126.0,222.0,97
mean,-23.619184,-19.555556,5.881982,1925-06-08 22:01:14.226804224
min,-999.0,-999.0,-11.8,1925-01-11 00:00:00
25%,0.0,5.225,-1.3,1925-04-08 00:00:00
50%,0.4,12.95,6.85,1925-05-26 00:00:00
75%,1.3,18.5,13.0,1925-08-06 00:00:00
max,6.0,26.1,22.3,1925-12-10 00:00:00
std,154.863589,178.211225,8.124848,


In [19]:
# let's convert precip_24h_mm	t_max_c -999 also to NA
df.loc[df['precip_24h_mm'] == -999, 'precip_24h_mm'] = pd.NA
df.loc[df['t_max_c'] == -999, 't_max_c'] = pd.NA
df.describe()

Unnamed: 0,precip_24h_mm,t_max_c,t_min_c,date_converted
count,239.0,122.0,222.0,97
mean,0.867364,12.557377,5.881982,1925-06-08 22:01:14.226804224
min,0.0,-4.4,-11.8,1925-01-11 00:00:00
25%,0.0,6.05,-1.3,1925-04-08 00:00:00
50%,0.5,13.5,6.85,1925-05-26 00:00:00
75%,1.4,18.575,13.0,1925-08-06 00:00:00
max,6.0,26.1,22.3,1925-12-10 00:00:00
std,1.080277,7.523,8.124848,


In [20]:
# let us do the same replacement of -999 to NaN for present_weather_code column
df.loc[df['present_weather_code'] == -999, 'present_weather_code'] = pd.NA
df.describe()

# how to understand the above operation
# we look for rows where present_weather_code is -999
# in those rows we change value of well present_weather_code to Not a Number or NaN

Unnamed: 0,precip_24h_mm,t_max_c,t_min_c,date_converted
count,239.0,122.0,222.0,97
mean,0.867364,12.557377,5.881982,1925-06-08 22:01:14.226804224
min,0.0,-4.4,-11.8,1925-01-11 00:00:00
25%,0.0,6.05,-1.3,1925-04-08 00:00:00
50%,0.5,13.5,6.85,1925-05-26 00:00:00
75%,1.4,18.575,13.0,1925-08-06 00:00:00
max,6.0,26.1,22.3,1925-12-10 00:00:00
std,1.080277,7.523,8.124848,


In [21]:
# let us see head again
df.head()

Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name,date_converted
0,22-08-1925,Laikam migla,1.2,rain\n,75,16.0,8.5,dobele_1925,1925-08-22
1,1925/08/03,Skursteņi kūp,3.5,snow\n,60,18.5,19.2,dobele_1925,1925-08-03
2,24-08-1925 14:00,Pūtis brāzmas,0.0,snow\n,50,17.6,,dobele_1925,NaT
3,1925/05/08,Novērojums vēlāk,0.0,none\n,81,17.7,5.9,dobele_1925,1925-05-08
4,11-07-1925 07:00,Neliels vējš,0.3,mixed\n,51,19.9,16.1,dobele_1925,NaT


In [22]:
# let us create a duplicate of first row at the very end
df.loc[len(df)] = df.iloc[0]
df.tail()

Unnamed: 0,date,notes,precip_24h_mm,precip_type,present_weather_code,t_max_c,t_min_c,file_name,date_converted
296,12.10.1925,\n,2.0,R,,,1.0,riga_university_1925_p2,1925-12-10
297,17.11.1925 07:00,Novērojums vēlāk\n,0.0,,80.0,,1.0,riga_university_1925_p2,NaT
298,15.09.1925,Vēlāk sāka līt\n,0.7,mixed,81.0,,3.9,riga_university_1925_p2,NaT
299,08.11.1925,\n,,,71.0,,-1.9,riga_university_1925_p2,1925-08-11
300,22-08-1925,Laikam migla,1.2,rain\n,75.0,16.0,8.5,dobele_1925,1925-08-22


In [23]:
# so let's see shape before we drop duplicates
print(f"Shape before dropping duplicates: {df.shape}")
# let us drop duplicates
df = df.drop_duplicates()
print(f"Shape after dropping duplicates: {df.shape}")

Shape before dropping duplicates: (301, 9)
Shape after dropping duplicates: (300, 9)


In [24]:
# let us save to xlxs WITHOUT index
# first let's create day_2_xlsx folder IF it does not exist using Path
Path("day_2_xlsx").mkdir(exist_ok=True)
# now let's save

df.to_excel("day_2_xlsx/all_stations_without_Liepaja_cleaned_and_partially_converted.xlsx", index=False)

In [25]:
# let us download to our own computer
from google.colab import files
files.download("day_2_xlsx/all_stations_without_Liepaja_cleaned_and_partially_converted.xlsx")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Next steps

So we learned to manipulate and clean columns in various ways

* change column order for ALL dataframes to be merged
* combine dataframes with merge or concat commands


In [None]:
# Example skeleton for one dataframe named df
# df = dfs_extra['some_file']
# print(df.head()); print(df.columns); df.info()

# rename_map = {
#     'Date': 'date', 'DATE':'date',
#     'Station':'station', 'City':'station',
#     'Tmin':'t_min', 'TminC':'t_min', 'Min':'t_min',
#     'Tmax':'t_max', 'TmaxC':'t_max', 'Max':'t_max',
#     'Precip':'precip', 'Rain':'precip'
# }
# df = df.rename(columns=lambda c: rename_map.get(str(c), str(c).strip().lower()))


### 🧱 Skeleton: Type Coercion & Missing Handling

In [None]:
# required_cols = ['date','station','t_min','t_max','precip']
# for c in required_cols:
#     if c not in df.columns:
#         df[c] = pd.NA

# df['date'] = pd.to_datetime(df['date'], errors='coerce')
# for c in ['t_min','t_max','precip']:
#     df[c] = pd.to_numeric(df[c], errors='coerce')

# # Drop rows with no usable date
# df = df.dropna(subset=['date'])

# # Optional: fill precip missing with 0 if domain-appropriate
# # df['precip'] = df['precip'].fillna(0)

### 🧱 Skeleton: Text Normalization & Duplicates

In [None]:
# df['station'] = df['station'].astype(str).str.strip().str.title()
# before = len(df)
# df = df.drop_duplicates()
# print('Removed', before - len(df), 'duplicate rows')

### 🧪 Suggested Sanity Checks

In [None]:
# assert df['date'].notna().all(), 'Null dates remain'
# # Optional plausibility checks (adjust to real units)
# assert (df['t_min'] <= df['t_max']).dropna().all(), 'Found t_min > t_max'

## Part 4: Merging Cleaned DataFrames

### Goal
Combine all standardized DataFrames into **one big DataFrame** with a **unified column structure**.

### Strategy
1. **Define the target schema** used in Part 3.
2. **Align each DataFrame** to the schema (add missing columns, reorder).
3. **Concatenate** with `pd.concat`.
4. **Final cleanup**: deduplicate, reindex, and sort by date/station.
5. **Save outputs** (`CSV` or `Parquet`) for Day 3 (EDA).

### Integration Checklist
- All DataFrames have columns: `date, station, t_min, t_max, precip`
- Dtypes are consistent across DataFrames
- No catastrophic loss of rows during coercion
- Final row count equals the sum of inputs minus duplicates

### 🧱 Skeleton: Alignment & Concatenation

In [None]:
# Suppose you have a dict of cleaned dfs: dfs_clean
# target_cols = ['date','station','t_min','t_max','precip']

# def coerce_to_schema(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
#     for c in cols:
#         if c not in df.columns:
#             df[c] = pd.NA
#     # Reorder and drop extras for now
#     return df[cols]

# aligned = [coerce_to_schema(d.copy(), target_cols) for d in dfs_clean.values()]
# big = pd.concat(aligned, axis=0, ignore_index=True)
# big = big.drop_duplicates().reset_index(drop=True)
# big = big.sort_values(['date','station'])
# big.head()

### 🧾 Export for Day 3

In [None]:
# out_dir = Path('outputs'); out_dir.mkdir(exist_ok=True)
# big.to_csv(out_dir / 'latvia_meteo_1925_cleaned_merged.csv', index=False)
# # Optional: Parquet for speed/size
# # big.to_parquet(out_dir / 'latvia_meteo_1925_cleaned_merged.parquet', index=False)

## 🔄 Reflection
- What kinds of messiness were easier to fix with **Python basics**?
- What kinds of messiness required **pandas**?
- What are the risks of “over-cleaning” or discarding too much data?