In [1]:
import numpy as np 
import pandas as pd
from pathlib import Path

We set some default values for our project and check the file

In [2]:
datadir = Path("data/raw/")
outputdir = Path("data/processed/")
filename = datadir / "homework1.csv"
filename.resolve(), filename.exists()

(PosixPath('/home/admindme/code/DME22/notebooks/les1/data/raw/les1.csv'), True)

In [3]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,x1,x2,name
0,4,0.683287,Python Regius
1,5,0.787097,Python Regius
2,7,,Python Regius
3,9,0.802364,Python Regius
4,0,,Python Regius


Let's check some of the statistics

In [4]:
df.describe()

Unnamed: 0,x1,x2
count,100.0,82.0
mean,4.62,0.587542
std,2.834777,0.229508
min,0.0,0.203242
25%,2.0,0.377285
50%,4.5,0.632648
75%,7.0,0.786594
max,9.0,0.996141


You can see x2 has some nans, from the count.
Let's select all columns with nans

In [5]:
select = list(df.isna().sum() > 0)
select

[False, True, False]

Check if it works

In [6]:
df.columns[select]

Index(['x2'], dtype='object')

Now drop the nans

In [7]:
df = df.dropna(subset=df.columns[select], axis="rows")
df

Unnamed: 0,x1,x2,name
0,4,0.683287,Python Regius
1,5,0.787097,Python Regius
3,9,0.802364,Python Regius
6,8,0.855227,Python Regius
7,9,0.861283,Python Regius
...,...,...,...
94,9,0.506065,Python Regius
95,7,0.785085,Python Regius
96,0,0.295006,Python Regius
97,1,0.768772,Python Regius


We dropped 18 rows.
Let's check the types:

In [8]:
df.dtypes

x1        int64
x2      float64
name     object
dtype: object

Let's clean out the name.
We use a regular expression to select the first word up to the first space.
Use https://regex101.com to create your own regular expressions.
Or use [you.com to try with natural language](https://you.com/search?q=regular+expression+that+selects+the+first+word+up+to+a+space&fromSearchBar=true)

In [9]:
import re

regex = re.compile("^[\w]+")
out = re.search(regex, "Python Regius")
out.group()

'Python'

Let's put that into a function

In [10]:
def extract(regex, msg):
    out = re.search(regex, msg)
    return out.group()

And apply it

In [11]:
df["name"] = df["name"].apply(lambda x: extract(regex=regex, msg=x))

In [12]:
df.head()

Unnamed: 0,x1,x2,name
0,4,0.683287,Python
1,5,0.787097,Python
3,9,0.802364,Python
6,8,0.855227,Python
7,9,0.861283,Python


Looks good.

Now we save the file with a timestamp.

In [13]:
if not outputdir.exists():
    outputdir.mkdir()

In [14]:
from datetime import datetime
tag = datetime.now().strftime("%Y%m%d-%H%M") + ".csv"
output = outputdir / tag
df.to_csv(output, index=False)

A lot of data scientists will stop here.

However, while the job is done, leaving things like this is very tricky.
Notebooks are for prototyping, not for creating a solid solution.

Now, have a look at the src folder.
Start at main.py, and also look at the other files.

Now go to the terminal, cd to the `les1` directory.
From there, you do:

`poetry shell`

`python src/main.py --file=les1.csv`

Note how a logging.log file appears, and check that.

# Excercise

Inside the data/raw folder in the root directory DME22 you will find a `palmerpenguins.parq` file.
This is a parquet file. If you always use csv, [read this](https://bawaji94.medium.com/feather-vs-parquet-vs-csv-vs-jay-55206a9a09b0) so you know why thats not a good idea.

## setup
- mkdir a new folder outside of the DME22 folder, named `cleanup`
- initialize a new poetry environment
- `poetry add` the libraries you need.
- mkdir a `data/raw` folder and copy `palmerpenguins.parq`. Bonuspoints if you use the `cp` command :)  
- mkdir a `data/processed`, `src` and `notebook` folder.
- create a `src/main.py` file

## prototype
Make a notebook where you:
- load the file with pandas
- check for nans
- figure out how you clean the out.
- double check how many rows you throw away. Find a solution if this doesnt look good.
- clean up the column with the names of the penguins. They are too long, so shorten them with a regex.
- save the cleaned file with a timestamp

## implement
after you prototyped this, create a `__init__.py` file inside the `src` folder.
Streamline the cleanup process as a command line executable process.
Use [click](https://click.palletsprojects.com/en/8.1.x/) to create easy arguments.

Try to add typehints.
Format your code with [black](https://github.com/psf/black) by running `black src` from the command line, where src is the folder you want to format. (make sure you cd-d to the folder so that you see `src` when you ls)

Add logging with loguru.

Don't hardcode any settings. Use pydantic with a settings.py file.



