# Lab 2 - Opening and Cleaning Data

<a href="https://colab.research.google.com/github/gaulinmp/AccountingDataAnalytics/blob/main/labs_hw/week2_connecting-to-data/Lab 2 - Opening and Cleaning Data.ipynb" target="_parent">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook will walk you through the process of opening and cleaning the Journal Entry dataset. We will cover the following steps:

1. Loading the data
2. Cleaning the data

The intent of the notebook is for you to execute each cell in order. Some will work, others will throw errors, and then I'll explain what's going on, and help you through it.
If you're new to programming, it can sometimes feel daunting, so I want to provide these initial notebooks to help you get started, and hopefully show you that programming (especially in Colab) isn't all that bad once you try it out.

In [None]:
# This is a comment

# The code below are called imports, which is basically telling python to load specific functionality
# This one uses a python library called pathlib to access the Path object, which is just a way of pointing to files on the computer
from pathlib import Path
# This import is the real magic, pandas is the library that makes data manipulation easy (well... easier)
import pandas as pd

In [None]:
# Point to the file, and make sure it exists
file_path = Path("JEA Detail.txt")
file_path.exists()

That should say True above. If it says False, then your data is not loaded correctly. Please make sure to drag the `JAE Detail.txt` file into your Colab page (see Lab 1 for instructions).

Okay, now to actually load the data:

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path)
# Display the first few rows of the DataFrame
df.head()

Oh no! An Error!

Welcome to debugging. This is often the hardest and least rewarding part of programming, because while there is often a few right ways to do something, there are infinite *wrong* ways to do something. So it's like searching for a needle in an infinite ocean of rusty needles.

*Future Mac here: I wrote this lab on my local machine, then tried it out on Google Colab. Colab shows you a button that says Next Steps: Explain Error. If you click this button, it immediately comes up with the solution. Being suspicious, I tried a new notebook with just the pandas read, and it took me 6 or 7 leading questions with this button and interacting with Gemini before I got the full solution working. So in this notebook, Gemini is reading down below, sees that I've solved the problem, and passes off the solution as its own. AI is all just plagerism anyway, so maybe unsurprising? Anyway, if you want the "authentic" coding with AI in Colab experience, I suggest starting an empty notebook and writing the code from scratch with its help. That'll be more indicative of what you might expect in practice. And now I know that Google Colab is great for debugging much faster than with just my meat brain!*

In this case, we look at the error, specifically `UnicodeDecodeError` and try googling for that. So I copy and paste that last line into [Gemini](https://gemini.google.com) (or [Claude](https://claude.ai) or [ChatGPT](https://chat.openai.com)), which tells me:

<blockquote>
  You're encountering a **UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte** because you're trying to decode a byte sequence as UTF-8, but the sequence doesn't conform to the UTF-8 encoding standard. Specifically, the byte `0xff` is not a valid starting byte for any UTF-8 character. This often happens when:

  * **The file isn't actually UTF-8 encoded.** It might be encoded in a different character set, like Latin-1 (ISO-8859-1), UTF-16, or something else entirely.
  * **The file is corrupted or contains binary data.** If the file isn't purely text, or if it's been damaged, you might find non-textual bytes at the beginning.

Here's how to approach and fix this error:

...
</blockquote>

It then suggests some code to try the other main encoding, `latin-1`:

In [None]:
try:
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
except UnicodeDecodeError:
    print("UTF-8 decoding failed. Trying latin-1...")
    try:
        with open(file_path, 'r', encoding='latin-1') as f:
            content = f.read()
        print("Successfully decoded with latin-1.")
    except Exception as e:
        print(f"Latin-1 decoding also failed: {e}")
        # You might try other encodings here, like 'cp1252' or 'utf-16'

Okay, so that worked! Thanks AI.

*Narrator: It didn't. Google's overconfidence would soon dissapoint the eager student.*

Okay, so back to pandas:

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path, encoding="latin-1")
# Display the first few rows of the DataFrame
df.head()

Nooooooooooooooooooooooooo. What's happening now? Well the error seems to say it expected 1 column, but saw 2. Wait, shouldn't there be more than 1 column? 

Okay, let's look at the data (we should have started here, but we're eager) (also note that exclamation mark `!head` command in the next line, that's not Python. Then what *is* it? I suggest you ask the internet to find out this lovely little bit of Notebook tech).

In [None]:
!head "{file_path}"

Uh... that's ugly. What's going on? Well, a few things. Playing with raw data is hard, Excel, Tableau, Alteryx, etc. have put a lot of code into trying to figure out things like this. But in Python, we'd have to do it ourselves. Let's start with the fact that there's two � things at the start. And here's where years of fighting with Python and Unicode helps (call it an unfair professor's advantage). Let's look at what those are:

In [None]:
raw_data = file_path.read_bytes()
# Print the first 16 bytes to inspect. Why 16? Because I started with 100 then realized I could make my point by just showing the first 16.
print(raw_data[:16])

So we see two bytes *(\x signifies hexidecimal, which is decimal + 6, so 1,2,3,4,5,6,7,8,9,a,b,c,d,e,f)*, which are "\xff\xfe" and then we see `C`, then `\x00`, then `o`, then `\x00`. So it looks like `Company` with `\x00` between each character. If we had fought with unicode long enough, we might immediately think "Oh, that's a 2 byte unicode, let's try UTF-16". But we might not have that experience, so instead, let's just ask gemini again:

**Me**: I have a file that starts with b'\xff\xfeC\x00o\x00m\x00p\x00a\x00n\x00y\x00' and I think it should start with "Company". What gives?

**Gemini**: That `b'\xff\xfeC\x00o\x00m\x00p\x00a\x00n\x00y\x00'` prefix means your file is encoded in UTF-16 Little Endian (UTF-16-LE). The `\xff\xfe` at the beginning is a Byte Order Mark (BOM), which tells a program how the bytes are arranged. Try `byte_string.decode('utf-16-le')`

In [None]:
print(raw_data.decode('utf-16-le')[:1000])

That looks like data! Okay, now to read it in Pandas. But wait! It looks like it has spaces (actually tabs) between the columns, not commas. And the pandas command we're using is `read_csv` which means comma separated values. So we have to tell it that the file uses tabs:

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path, encoding="utf-16-le", sep='\t')
# Display the first few rows of the DataFrame
df.head(10)

Well, the data look ugly, but at least we are reading it in.

Now we should just skip those first 8 rows of "noise":

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path, encoding="utf-16-le", sep='\t', skiprows=8)
# Display the first few rows of the DataFrame
df.head(10)

Those headers are ugly, because the input file wasn't machine-readable friendly, and split the columns across two lines. Unfortunately, there's no clean way to deal with that (we could open the file, look for the ===== separator, and then re-assemble the headers manually). For now, let's assume that these columns won't change, so we can just hard-code the column names that need extra words, so skip 9 rows and add back "Account", "Post", "User", and "Transaction".

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path, encoding="utf-16-le", sep='\t', thousands=',', skiprows=9)
# Rename the columns
df = df.rename(columns={
    'Category': 'Account Category',
    'Date': 'Post Date',
    'ID': 'User ID',
    'Number': 'Transaction Number'
})
# Display the first few rows of the DataFrame
df.head(10)

Now let's drop the three columns at the end that are entirely empty. That's what the `how='all'` means. `axis=1` means check columns, `axis=0` would check the rows for all missings.

In [None]:
df = df.dropna(how='all', axis=1)

Now let's remove newline (`\n`) at the beginning of the `BeanCounter25`. While you're at it, [look up](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) what `.str` is doing.

In [None]:
df["User ID"] = df["User ID"].str.strip()

Let's look at the cleaned DataFrame again (I'll cheat and show you a few rows down, so you can see the next issue, which is what [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) is doing):

In [None]:
df.iloc[42:48]

So we have some bad lines to remove. I suggest dropping them by requiring one of the columns to be non-missing.

In [None]:
df = df.dropna(subset=['User ID'])

Lastly, we have "data types" to fix. If you look at the `dtypes` attribute, you'll see that the `Amount` and `Period` columns are `object`, which means they're strings.

In [None]:
df.dtypes

But we know that these are actually a numeric value and date respectively, so we you should convert them to the appropriate types:

In [None]:
df['Amount'] = # your code to make the Amount column a number
df['Post Date'] = # your code to make the Post Date column a date

Now, the last step is to produce the Lab's output, which is a screenshot of the cleaned table, and to list the unique `User ID` values. There are a few ways to do this, and if you put your cursor after `df['User ID']` and wait, Gemini might even suggest a solution for you (but only because I wrote that comment, if it weren't there, then it wouldn't know what you want. Remember this for future reference, it's *incredibly useful* to know that you can often just write a comment and get AI to give you the code to do it).

In [None]:
df.head()

In [None]:
# List the unique values:
df['User ID']

Now take a screenshot, and you're done! I'd save this notebook, because Homework 2 builds on it.