# Lab 2 - Opening and Cleaning Data

<a href="https://colab.research.google.com/github/gaulinmp/AccountingDataAnalytics/blob/main/labs_hw/week2_connecting-to-data/Lab 2 - Opening and Cleaning Data.ipynb" target="_parent">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook will walk you through the process of opening and cleaning the Journal Entry dataset. We will cover the following steps:

1. Loading the data
2. Cleaning the data

The intent of the notebook is for you to execute each cell in order. Some will work, others will throw errors, and then I'll explain what's going on, and help you through it.
If you're new to programming, it can sometimes feel daunting, so I want to provide these initial notebooks to help you get started, and hopefully show you that programming (especially in Colab) isn't all that bad once you try it out.

In [None]:
# This is a comment

# The code below are called imports, which is basically telling python to load specific functionality
# This one uses a python library called pathlib to access the Path object, which is just a way of pointing to files on the computer
from pathlib import Path
# This import is the real magic, pandas is the library that makes data manipulation easy (well... easier)
import pandas as pd

In [None]:
# Point to the file, and make sure it exists
file_path = Path("JEA Detail Raw.txt")
file_path.exists()

That should say True above. If it says False, then your data is not loaded correctly. Please make sure to drag the `JAE Detail.txt` file into your Colab page (see Lab 1 for instructions).

Okay, now to actually load the data:

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path)
# Display the first few rows of the DataFrame
df.head()

Oh no! An Error!

Welcome to debugging. This is often the hardest and least rewarding part of programming, because while there is often a few right ways to do something, there are infinite *wrong* ways to do something. So it's like searching for a needle in an infinite ocean of rusty needles.

*Future Mac here: I wrote this lab on my local machine, then tried it out on Google Colab. Colab shows you a button that says Next Steps: Explain Error. If you click this button, it immediately comes up with the solution. Being suspicious, I tried a new notebook with just the pandas read, and it took me 6 or 7 leading questions with this button and interacting with Gemini before I got the full solution working. So in this notebook, Gemini is reading down below, sees that I've solved the problem, and passes off the solution as its own. AI is all just plagerism anyway, so maybe unsurprising? Anyway, if you want the "authentic" coding with AI in Colab experience, I suggest starting an empty notebook and writing the code from scratch with its help. That'll be more indicative of what you might expect in practice. And now I know that Google Colab is great for debugging much faster than with just my meat brain!*

In this case, we look at the error, specifically `ParserError` and try googling for that. So I copy and paste that last line into [Gemini](https://gemini.google.com) (or [Claude](https://claude.ai) or [ChatGPT](https://chat.openai.com)), which tells me:

<blockquote>
  This is a common **Pandas** error. It usually occurs when `read_csv` tries to parse a file but finds a row that violates the structure established by the header or the first row.

Specifically, "Expected 1 fields... saw 2" usually implies one of two things:

1. **Wrong Separator:** Pandas failed to find the default separator (comma) in the first row, so it assumed the file has only 1 column. Then, on line 14, it found a character that acted like a separator.
2. **Metadata/Junk at the top:** The file has a title or description in the first few lines that looks like a single column, confusing the parser before it hits the actual data.

Here are the most effective ways to fix this.

...
</blockquote>

It then suggests to try a different separater, starting with tab:

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path, sep='\t')
# Display the first few rows of the DataFrame
df.head()

Same error, different numbers. What's happening now? Well the error seems to say it expected 10 fields, but saw 37. 

Okay, let's look at the data (we should have started here, but we're eager) (also note that exclamation mark `!head` command in the next line, that's not Python. Then what *is* it? I suggest you ask the internet to find out this lovely little bit of Notebook tech).

In [None]:
!head "{file_path}"

Uh... that's ugly. What's going on? Well, a few things. Playing with raw data is hard, Excel, Tableau, Alteryx, etc. have put a lot of code into trying to figure out things like this. But in Python, we'd have to do it ourselves. So we can see that there's a bunch of text that isn't the actual file we want. In fact, we can't even see the data we expect. Let look at more lines:

In [None]:
!head -n 20 "{file_path}"

Well there's the data, finally. So we're closer. But there's a bunch of random text above the table we want, and those headers (line with `Account "Account`) are ugly and splits the column names across two lines. That means that we need to *a)* skip the first 8 lines (you can count to verify), and *b)* strip newlines from the column names.

In [None]:
# Load the data into pandas
df = pd.read_csv(file_path, sep='\t', thousands=',', skiprows=8)
# Rename the columns, using a trick to remove extra spaces and newlines. Essentially, we split each column name on whitespace, then re-join with a single space, and strip any leading/trailing spaces.
df = df.rename(columns={c:' '.join(c.split()).strip() for c in df.columns})
# Display the first few rows of the DataFrame
df.head(10)

Now let's drop the three columns at the end that are entirely empty. That's what the `how='all'` means. `axis=1` means check columns, `axis=0` would check the rows for all missings.

In [None]:
df = df.dropna(how='all', axis=1)

Now let's remove newlines (`\n`) from the user names. While you're at it, [look up](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) what `.str` is doing.

In [None]:
df["User ID"] = df["User ID"].str.strip()

Let's look at the cleaned DataFrame again (I'll cheat and show you a few rows down, so you can see the next issue, which is what [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) is doing):

In [None]:
df.iloc[42:48]

So we have some bad lines to remove. I suggest dropping them by requiring one of the columns to be non-missing.

In [None]:
df = df.dropna(subset=['User ID'])

Lastly, we have "data types" to fix. If you look at the `dtypes` attribute, you'll see that the `Post Date` is an `object`, which means it's a string, not a date.

In [None]:
df.dtypes

But we know that these are actually a numeric value and date respectively, so we you should convert them to the appropriate types:

In [None]:
df['Post Date'] = # write your own code (or ask AI) to make the Post Date column a date

Now, the last step is to produce the Lab's output, which is a screenshot of the cleaned table, and to list the unique `User ID` values. There are a few ways to do this, and if you put your cursor after `df['User ID']` and wait, Gemini might even suggest a solution for you (but only because I wrote that comment, if it weren't there, then it wouldn't know what you want. Remember this for future reference, it's *incredibly useful* to know that you can often just write a comment and get AI to give you the code to do it).

In [None]:
df.head()

In [None]:
# Write your own code to list the unique values:
df['User ID']

Now take a screenshot, and you're done! I'd save this notebook, because Homework 2 builds on it.