In [3]:
import pandas as pd
import numpy as np

### Quick function to make our data

In [4]:
def create_sample_data():
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'D@vid', np.nan],
        'Age': [25, 30, 35, 40, 28],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
        'ExtraColumn': ['X', 'Y', 'Z', 'W', 'V']
    }
    return pd.DataFrame(data)

### The below drops a column and then does some cleaning.

In [5]:
def drop_extra_column(df):
    return df.drop(columns=['ExtraColumn'])

In [6]:
def clean_names(df):
    df['Name'] = df['Name'].fillna('Unknown')
    df['Name'] = df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
    return df

### calling the functions to make the data then cleaning it.

In [7]:
df = create_sample_data()
df

Unnamed: 0,Name,Age,City,ExtraColumn
0,Alice,25,New York,X
1,Bob,30,Los Angeles,Y
2,Charlie,35,Chicago,Z
3,D@vid,40,Houston,W
4,,28,Miami,V


In [8]:
df = drop_extra_column(df)
df = clean_names(df)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,Dvid,40,Houston
4,Unknown,28,Miami


### The below is how it would be setup in a seperate python file a `.py`

We do add a function, that was not in the code above.

```python
def clean_data()
```
This just calls the 2 cleaning functions.

In [9]:
import pandas as pd
import numpy as np


def create_sample_data():
    data = {
        'Name': ['Alice', 'Bob', 'Charlie', 'D@vid', np.nan],
        'Age': [25, 30, 35, 40, 28],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
        'ExtraColumn': ['X', 'Y', 'Z', 'W', 'V']
    }
    return pd.DataFrame(data)


def drop_extra_column(df):
    return df.drop(columns=['ExtraColumn'])


def clean_names(df):
    df['Name'] = df['Name'].fillna('Unknown')
    df['Name'] = df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
    return df


def clean_data(df):
    df = drop_extra_column(df)
    df = clean_names(df)
    return df


if __name__ == "__main__":
    raw_df = create_sample_data()
    print("Raw Data:\n", raw_df)

    cleaned_df = clean_data(raw_df)
    print("\nCleaned Data:\n", cleaned_df)


Raw Data:
       Name  Age         City ExtraColumn
0    Alice   25     New York           X
1      Bob   30  Los Angeles           Y
2  Charlie   35      Chicago           Z
3    D@vid   40      Houston           W
4      NaN   28        Miami           V

Cleaned Data:
       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3     Dvid   40      Houston
4  Unknown   28        Miami


# recap

That is technically fine we would be able to call the individual functions into a python file or notebook and run them.

we would do:
```python
from file.name import drop_extra_column
```

Then each function we needed we could repeat that process.

We could also, but should never do:
```python
from file.name import *
```

This would bring in all the functions but can cause problems.

## This is how it would be made into a class

# Here is the cleaning as a class.

In [10]:
class DataCleaner:
    def __init__(self, df):
        self.df = df.copy()

    def drop_extra_column(self):
        if 'ExtraColumn' in self.df.columns:
            self.df = self.df.drop(columns=['ExtraColumn'])
        return self.df

    def clean_names(self):
        self.df['Name'] = self.df['Name'].fillna('Unknown')
        self.df['Name'] = self.df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
        return self.df

    def clean(self):
        self.drop_extra_column()
        self.clean_names()
        return self.df


In [11]:
# making a dataframe to use the class on
class_clean = create_sample_data()
class_clean

Unnamed: 0,Name,Age,City,ExtraColumn
0,Alice,25,New York,X
1,Bob,30,Los Angeles,Y
2,Charlie,35,Chicago,Z
3,D@vid,40,Houston,W
4,,28,Miami,V


- The first line instantiates the class (That’s just a fancy way of saying we’re creating an object from the class so we can use it)
- In the second line, we’re calling the clean method (function, they are called methods now that its in a class) on the cleaner object using dot notation.

In [12]:
cleaner = DataCleaner(class_clean)
class_clean = cleaner.clean()
class_clean

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,Dvid,40,Houston
4,Unknown,28,Miami


With this example we have to call the class then call the method we want to use fromt he class.

### we can add to the __init__() to make the clean method auto run the class on data

In [13]:
class AutoDataCleaner:
    def __init__(self, df):
        self.df = df.copy()
        self.clean()

    def drop_extra_column(self):
        if 'ExtraColumn' in self.df.columns:
            self.df = self.df.drop(columns=['ExtraColumn'])
        return self.df

    def clean_names(self):
        self.df['Name'] = self.df['Name'].fillna('Unknown')
        self.df['Name'] = self.df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
        return self.df

    def clean(self):
        self.drop_extra_column()
        self.clean_names()
        return self.df

In [14]:
auto_clean = create_sample_data()
print(f'This is the raw data \n \n {auto_clean}\n \n')

auto_clean = AutoDataCleaner(auto_clean)
print(f'This is the cleaned data \n\n {auto_clean.df}')

This is the raw data 
 
       Name  Age         City ExtraColumn
0    Alice   25     New York           X
1      Bob   30  Los Angeles           Y
2  Charlie   35      Chicago           Z
3    D@vid   40      Houston           W
4      NaN   28        Miami           V
 

This is the cleaned data 

       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3     Dvid   40      Houston
4  Unknown   28        Miami


To call each function itself we have to take out the ` # self.clean()` like it was originally

In [15]:
class DataCleaner:
    def __init__(self, df):
        self.df = df.copy()
        # self.clean()

    def drop_extra_column(self):
        if 'ExtraColumn' in self.df.columns:
            self.df = self.df.drop(columns=['ExtraColumn'])
        return self.df

    def clean_names(self):
        self.df['Name'] = self.df['Name'].fillna('Unknown')
        self.df['Name'] = self.df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
        return self.df

    def clean(self):
        self.drop_extra_column()
        self.clean_names()
        return self.df


In [16]:
class_clean = create_sample_data()
class_clean

Unnamed: 0,Name,Age,City,ExtraColumn
0,Alice,25,New York,X
1,Bob,30,Los Angeles,Y
2,Charlie,35,Chicago,Z
3,D@vid,40,Houston,W
4,,28,Miami,V


In [17]:
cleaner = DataCleaner(class_clean)
cleaner.df

Unnamed: 0,Name,Age,City,ExtraColumn
0,Alice,25,New York,X
1,Bob,30,Los Angeles,Y
2,Charlie,35,Chicago,Z
3,D@vid,40,Houston,W
4,,28,Miami,V


## Step-by-Step Breakdown:
### 1. DataCleaner(class_clean)
This creates a new instance of the DataCleaner class.

- It takes in a DataFrame (class_clean) and passes it to the class's __init__() method.

- Inside __init__(), it makes a copy of that DataFrame and saves it to self.df

- So now you have a DataCleaner object that has a copy of your original data inside it.


### 2. cleaner = ...
This stores the new DataCleaner object in the variable called cleaner.

### 3. cleaner.df
This accesses the .df attribute of the cleaner object — which is the cleaned (or original) DataFrame stored inside the class.

## What This Lets You Do:
- Now you can use methods like cleaner.drop_extra_column() or cleaner.clean_names().

- Or, after doing those, just look at the data by typing cleaner.df.

## In Simple Terms:
You're making a helper object (cleaner) that knows how to clean your data, and then checking the current state of that data inside it using .df.

In [18]:
# now we can call methods on the cleaner dataframe
cleaned_df = cleaner.clean()
cleaned_df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,Dvid,40,Houston
4,Unknown,28,Miami


# usining only the column drop

Calling a function from the class. We will assume we only want to drop columns.

In [19]:
col_test = create_sample_data()
col_test

Unnamed: 0,Name,Age,City,ExtraColumn
0,Alice,25,New York,X
1,Bob,30,Los Angeles,Y
2,Charlie,35,Chicago,Z
3,D@vid,40,Houston,W
4,,28,Miami,V


The first line instantiates the class. That’s just a fancy way of saying we’re creating an object from the class so we can use it. In the second line, we’re calling the drop_extra_column method on the cleaner object using dot notation.

In [20]:
cleaner = DataCleaner(col_test)
col_test = cleaner.drop_extra_column()
col_test

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,D@vid,40,Houston
4,,28,Miami


The benefit of writing code using classes instead of just standalone functions lies in the flexibility and organization classes provide. With a class, you can bundle related functions (methods) and the data they operate on (attributes) together, making your code more modular and easier to manage.

For example, you can call the whole class to perform a complete cleaning process, or you can call individual methods to perform specific tasks as needed. This makes the code reusable, flexible, and better organized.


# The final boss!

Adding the doc string and type hinting!

In [21]:
import pandas as pd

class DataCleaner:
    """
    A class for cleaning a pandas DataFrame by removing extra columns
    and cleaning values in the 'Name' column.
    """

    def __init__(self, df: pd.DataFrame):
        """
        Initialize the DataCleaner with a DataFrame.

        Args:
            df (pd.DataFrame): The input DataFrame to be cleaned.
        """
        self.df: pd.DataFrame = df.copy()

    def drop_extra_column(self) -> pd.DataFrame:
        """
        Drops the 'ExtraColumn' from the DataFrame if it exists.

        Returns:
            pd.DataFrame: The updated DataFrame without 'ExtraColumn'.
        """
        if 'ExtraColumn' in self.df.columns:
            self.df = self.df.drop(columns=['ExtraColumn'])
        return self.df

    def clean_names(self) -> pd.DataFrame:
        """
        Cleans the 'Name' column by:
        - Filling missing values with 'Unknown'
        - Removing any characters that are not letters or spaces

        Returns:
            pd.DataFrame: The updated DataFrame with cleaned 'Name' values.
        """
        self.df['Name'] = self.df['Name'].fillna('Unknown')
        self.df['Name'] = self.df['Name'].str.replace(r'[^a-zA-Z ]', '', regex=True)
        return self.df

    def clean(self) -> pd.DataFrame:
        """
        Applies all cleaning steps:
        - Drops 'ExtraColumn'
        - Cleans the 'Name' column

        Returns:
            pd.DataFrame: The fully cleaned DataFrame.
        """
        self.drop_extra_column()
        self.clean_names()
        return self.df


In [22]:
DataCleaner?

[31mInit signature:[39m DataCleaner(df: pandas.core.frame.DataFrame)
[31mDocstring:[39m     
A class for cleaning a pandas DataFrame by removing extra columns
and cleaning values in the 'Name' column.
[31mInit docstring:[39m
Initialize the DataCleaner with a DataFrame.

Args:
    df (pd.DataFrame): The input DataFrame to be cleaned.
[31mType:[39m           type
[31mSubclasses:[39m     

In [23]:
df = create_sample_data()
cleaner = DataCleaner(df)

In [24]:
cleaner.drop_extra_column?

[31mSignature:[39m cleaner.drop_extra_column() -> pandas.core.frame.DataFrame
[31mDocstring:[39m
Drops the 'ExtraColumn' from the DataFrame if it exists.

Returns:
    pd.DataFrame: The updated DataFrame without 'ExtraColumn'.
[31mFile:[39m      /var/folders/2d/yt4_w6zn5pbfjg_jx5sdmm180000gn/T/ipykernel_67974/59002941.py
[31mType:[39m      method

## What This Code Did:
By adding docstrings (the triple-quoted text under each method and class) and type hints, you made your code self-documenting.

## What "self-documenting" means:
- You don’t need to read the source code to understand what each method does.
- Anyone (including future-you) can:
- Hover over methods in an IDE (like VS Code)
- Use ? in a notebook (e.g., cleaner.drop_extra_column?)
-Instantly see:

  - What the method does
  - What arguments it expects
  - What type of data it returns

## Example:
When you type:

```python
cleaner.drop_extra_column?
```

You now see something like:

```sql
Signature: cleaner.drop_extra_column()
Docstring: Drops the 'ExtraColumn' from the DataFrame if it exists.

Returns:
    pd.DataFrame: The updated DataFrame without 'ExtraColumn'.
```

Without even opening the file or reading the code, this tells you:
- What the method does
- That it doesn’t need any arguments
- That it gives you back a cleaned DataFrame


## Why It’s Useful:

- Great for teamwork
- Others can easily use your class without asking questions
- Helps in debugging and testing... you know what to expect
- Makes code more professional and maintainable
- Some tools (like Sphinx or MkDocs) can even turn these docstrings into full documentation websites

