# Keep your code clean

Jupyter notebooks often start as exploration and experimentation by an indivudal data scientist. The code is not mean to be shared and certainly not to run beyond the early days of a project. However, snippets of code from Jupyter notebooks often make their way to production servers, where they run for months, possibly years.

Software engineers have been developing best practices around making code more readable. Although the basic principles are shared, languages also develop their own culsture and aesthetics. For example, in Java variables are namde in caml case, such as `studentAge`, `dateOfBirth`, C# uses pascal case capitalizes the first letter, such as `StudentAge`, `DateOfBirth`, lisp uses kebab-case as `student-age`, `date-of-birth` and our favorite language, Python uses snake-case: `student_age`, `date_of_birth` for variables. 

Languages often use a different convention for class names. Python pascal case: `TextClassifier`, `ConsoleLogger`.

**NOTE**: The examples below are guidelines. There are often good reasons to deviate from best practices. Do not feel beholden to the rules below. As with any culture, some norms are barely enforced and some norms will cast you out.

### Names should be meaningful

**Variables**

```python
# Bad
x = np.mean(data)
df2 = process(df1)
lst = [1, 2, 3]

# Good
average_score = np.mean(student_scores)
processed_dataframe = process(raw_dataframe)
prime_numbers = [1, 2, 3]
```

Notice that we can make this look even better (and easier to ready) by aligning the equals signs:
```python
average_score       = np.mean(student_scores)
processed_dataframe = process(raw_dataframe)
prime_numbers       = [1, 2, 3]
```


**Functions**

```python
# Bad
def proc_data():
    pass

# Good
def preprocess_text_data(raw_text):
    pass
```

**Classes**

```python
# Bad
class Mdl:
    pass

# Good
class TextClassifier:
    pass
```

### Imports should be organized by functionality and source

```python
# Standard library
import os
from typing import Dict, List

# Third-party
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Local
from src.data.loading import load_dataset
from src.features.engineering import create_features
```

### Functions should be documented using docstrings

In [3]:
def calculate_feature_importance(
    model,
    feature_name
) :
    """Calculate feature importance scores from a trained random forest model.
    
    Args:
        model: Trained random forest classifier
        feature_names: List of feature names corresponding to model features
        
    Returns:
        Dictionary mapping feature names to their importance scores
        
    Raises:
        ValueError: If length of feature_names doesn't match model features
    """
    pass

Notice that this docstring now allows us to ask for help

In [4]:
help(calculate_feature_importance)

Help on function calculate_feature_importance in module __main__:

calculate_feature_importance(model, feature_name)
    Calculate feature importance scores from a trained random forest model.

    Args:
        model: Trained random forest classifier
        feature_names: List of feature names corresponding to model features

    Returns:
        Dictionary mapping feature names to their importance scores

    Raises:
        ValueError: If length of feature_names doesn't match model features



In [5]:
calculate_feature_importance() #place cursor inside parenthesis and press SHIFT+TAB

TypeError: calculate_feature_importance() missing 2 required positional arguments: 'model' and 'feature_name'

**Explain the code appropriately**

```python
# Bad: Explains what the code does
# Loop through the dataframe
for idx, row in df.iterrows():
    processed.append(row)

# Good: Explains why the code is needed
# Handle missing values before modeling to prevent training errors
df.fillna(df.mean(), inplace=True)
```

### Functions should not be too large and do one thing

```python
# Bad
def process_and_train_and_evaluate():
    # Load data
    # Process data
    # Train model
    # Evaluate model
    # Save results
    pass

# Good
def load_data(filepath: str) -> pd.DataFrame:
    pass

def preprocess_features(df: pd.DataFrame) -> pd.DataFrame:
    pass

def train_model(X: np.ndarray, y: np.ndarray) -> sklearn.base.BaseEstimator:
    pass

def evaluate_model(model: sklearn.base.BaseEstimator, X_test: np.ndarray, y_test: np.ndarray) -> Dict[str, float]:
    pass

def process_and_train_and_evaluate():
    load_data()
    preprocess_features()
    train_model()
    evaluate_model()
```

**NOTE** We will learn more about these conventions, including additional ones, like adding type annotations, throughout this course.

### Automated tools
There are a large number of tools which can help you format and check your code. Two tools, important for this lecture are: `pylint` and `black`

In [11]:
%%writefile messy.py
# messy.py
import pandas as pd,numpy as np
from typing import List,Dict,Any
import matplotlib.pyplot as plt

class dataProcessor:
    def __init__(self,input_file:str,    output_file:str='processed.csv'):
        self.input=input_file
        self.output_file=output_file
        self.data=None
    
    def Load_data(self):
        """loads data from csv file"""
        self.data=pd.read_csv(self.input)
        return self.data
    
    def process(self,columns_to_process:List[str]=[],aggfunc:str='mean')->pd.DataFrame:
        if len(columns_to_process)==0: return self.data
        processed_data={}
        for col in columns_to_process:
         if col in self.data.columns:
          processed_data[col]=getattr(self.data[col],aggfunc)()
         else:
            print(f"Warning: Column {col} not found")
        return pd.DataFrame(processed_data,index=[0])

    def visualize_data(self,   column:str,   PlotType:str='bar'   )->None:
        if self.data is None:raise ValueError('No data loaded')
        plt.figure(figsize=(10,     5))
        if PlotType=='bar':
            self.data[column].value_counts().plot(kind='bar')
        elif     PlotType=='hist':
            self.data[column].hist()
        plt.title(f'Visualization of {column}')
        plt.show()

def main():
    processor=dataProcessor('data.csv')
    df = processor.Load_data()
    processed=processor.process(['age','salary'],aggfunc='mean')
    processor.visualize_data('age','hist')

if __name__=='__main__':
    main()

Overwriting messy.py


Turn on line numbers from View->Show Line Numbers to match output from the following programs to the program above

You may have to install these tools:

`pip install pylint black`

In [12]:
!pylint messy.py

************* Module messy
messy.py:11:0: C0303: Trailing whitespace (trailing-whitespace)
messy.py:16:0: C0303: Trailing whitespace (trailing-whitespace)
messy.py:21:0: W0311: Bad indentation. Found 9 spaces, expected 12 (bad-indentation)
messy.py:22:0: W0311: Bad indentation. Found 10 spaces, expected 16 (bad-indentation)
messy.py:23:0: W0311: Bad indentation. Found 9 spaces, expected 12 (bad-indentation)
messy.py:24:0: W0311: Bad indentation. Found 12 spaces, expected 16 (bad-indentation)
messy.py:1:0: C0114: Missing module docstring (missing-module-docstring)
messy.py:2:0: C0410: Multiple imports on one line (pandas, numpy) (multiple-imports)
messy.py:1:0: F0002: messy.py: Fatal error while checking 'messy.py'. Please open an issue in our bug tracker so we address this. There is a pre-filled template that you can use in 'C:\Users\shahb\AppData\Local\pylint\pylint\Cache\pylint-crash-2025-01-07-02-16-46.txt'. (astroid-error)

----------------------------------------------------------

Exception on node <ImportFrom l.3 at 0x195ea678ce0> in file 'C:\Users\shahb\OneDrive\Documents\GitHub\ProgrammingForAnalytics\lectures\110_python_py_files\messy.py'
Traceback (most recent call last):
  File "C:\Users\shahb\anaconda3\Lib\site-packages\pylint\checkers\imports.py", line 846, in _get_imported_module
    return importnode.do_import_module(modname)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shahb\anaconda3\Lib\site-packages\astroid\nodes\_base_nodes.py", line 146, in do_import_module
    return mymodule.import_module(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shahb\anaconda3\Lib\site-packages\astroid\nodes\scoped_nodes\scoped_nodes.py", line 527, in import_module
    return AstroidManager().ast_from_module_name(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shahb\anaconda3\Lib\site-packages\astroid\manager.py", line 232, in ast_from_module_name
    return self.ast_from_file(found_spec.location, modname, fallback=False)
   

In [13]:
!black messy.py --diff --color

[1m--- messy.py	2025-01-07 08:16:42.252498+00:00[0m
[1m+++ messy.py	2025-01-07 08:16:58.016915+00:00[0m
[36m@@ -1,44 +1,51 @@[0m
 # messy.py
[31m-import pandas as pd,numpy as np[0m
[31m-from typing import List,Dict,Any[0m
[32m+import pandas as pd, numpy as np[0m
[32m+from typing import List, Dict, Any[0m
 import matplotlib.pyplot as plt
 
[32m+[0m
 class dataProcessor:
[31m-    def __init__(self,input_file:str,    output_file:str='processed.csv'):[0m
[31m-        self.input=input_file[0m
[31m-        self.output_file=output_file[0m
[31m-        self.data=None[0m
[31m-    [0m
[32m+    def __init__(self, input_file: str, output_file: str = "processed.csv"):[0m
[32m+        self.input = input_file[0m
[32m+        self.output_file = output_file[0m
[32m+        self.data = None[0m
[32m+[0m
     def Load_data(self):
         """loads data from csv file"""
[31m-        self.data=pd.read_csv(self.input)[0m
[32m+        self.data = pd.read_csv(self.input)

would reformat messy.py

All done! \u2728 \U0001f370 \u2728
1 file would be reformatted.
