## Refactoring

#### 1. Fixing Duplicated Code by Pulling Up a Function

Bad Code Smell: Duplicated Code
Refactoring Motif: Pull Up Function

Imagine you are trying to compare the performance of Adaboost versus RandomForest on the famous `Kaggle Titanic Competition`:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier


def load_df():
    df = pd.read_csv('../data/titanic_dataset.csv')
    df["is_male"] = df["Sex"] == "male"
    return df


def evaluate_adaboost_model(data):
    X = data[["is_male", "SibSp", "Pclass", "Fare"]]
    y = data['Survived']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = AdaBoostClassifier()
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    return accuracy


def evaluate_random_forest_model(data):
    X = data[["is_male", "SibSp", "Pclass", "Fare"]]
    y = data['Survived']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    return accuracy


df = load_df()

accuracy_adaboost = evaluate_adaboost_model(df)
print(f"AdaBoost Accuracy: {accuracy_adaboost:.3f}")

accuracy_random_forest = evaluate_random_forest_model(df)
print(f"Random Forest Accuracy: {accuracy_random_forest:.3f}")


And now, you want to add an additional comparison for XGBoost before handing it off to some engineers to run it on a much larger dataset. But first, you smell something is amiss… are you really going to create a new function, `evaluate_xgboost_model()`? `evaluate_adaboost_model()` and `evaluate_random_forest_model()` already appear to have a lot of repeated code that you’d have to copy and paste.

Well, if you find yourself needing to copy and paste code, it’s usually a hint to refactor. To fix this case of duplicated code, we’ll use the “pull up function” refactoring, in which we extract a generalized version of two nearly identical functions*. More specifically, we’ll pull up a function `evaluate_model()`, which will take in the model of interest as an argument, thus deprecating `evaluate_adaboost_model()` and `evaluate_random_forest_model()`:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier


def load_df():
    df = pd.read_csv('../data/titanic_dataset.csv')
    df["is_male"] = df["Sex"] == "male"
    return df


#### PULLED UP METHOD ####
def evaluate_model(data, model_constructor):
    X = data[["is_male", "SibSp", "Pclass", "Fare"]]
    y = data['Survived']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = model_constructor()
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    return accuracy


df = load_df()

accuracy_adaboost = evaluate_model(df, AdaBoostClassifier)
print(f"AdaBoost Accuracy: {accuracy_adaboost:.3f}")

accuracy_random_forest = evaluate_model(df, RandomForestClassifier)
print(f"Random Forest Accuracy: {accuracy_random_forest:.3f}")


Instead of immediately putting on the “add features” hat, we took a moment, and put on the “refactor hat”. Without changing the observable behavior of the code, we managed to improve its internal design in such a way that not only did our code get shorter, but adding an extra model-evaluation for XGBoost (and any other model) is now trivial:

In [None]:
from xgboost import XGBClassifier


accuracy_xgboost = evaluate_model(df, XGBClassifier)
print(f"XGBoost Accuracy: {accuracy_xgboost:.3f}")


#### 2. Fixing a Mysterious Name by Renaming a Variable
Bad Code Smell: Mysterious Name
Refactoring Motif: Rename Variable

Now, let’s say you got good results (yay!), so you show it to your colleague, who immediately interrupts you by saying “What’s actually in df though?”.

Often, renaming a variable has the highest ratio of reward-to-effort of any refactoring motif; with a little bit of thought and just a few keystrokes, you can preemptively answer questions about the data that’s getting passed around your code. In this case, it should be easy to come up with a clear name that communicates precisely what’s inside of our dataframe; doing so will not only answer your colleague’s question, but it will preempt any other such questions:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier


#### RENAMED FUNCTION ####
def load_titanic_passengers_df():
    titanic_passengers_df = pd.read_csv('../data/titanic_dataset.csv')
    titanic_passengers_df["is_male"] = titanic_passengers_df["Sex"] == "male"
    return titanic_passengers_df


def evaluate_model(data, model_constructor):
    X = data[["is_male", "SibSp", "Pclass", "Fare"]]
    y = data['Survived']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    model = model_constructor()
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    return accuracy


#### RENAMED VARIABLE ####
titanic_passengers_df = load_titanic_passengers_df()

accuracy_adaboost = evaluate_model(titanic_passengers_df, AdaBoostClassifier)
print(f"AdaBoost Accuracy: {accuracy_adaboost:.3f}")

accuracy_random_forest = evaluate_model(titanic_passengers_df, RandomForestClassifier)
print(f"Random Forest Accuracy: {accuracy_random_forest:.3f}")

accuracy_xgboost = evaluate_model(df, XGBClassifier)
print(f"XGBoost Accuracy: {accuracy_xgboost:.3f}")


After renaming `df` to `titanic_passengers_df` and `load_df()` to `load_titanic_passengers_df()`, there’s no doubt at all – every row in the newly named `titanic_passengers_df` represents a distinct passenger from the Titanic. Nice!

#### 3. Fix Magic Values by Extracting Variables
Bad Code Smell: Magic Values
Refactoring Motif: Extract Variable

So you’ve discussed it to your colleagues, and they all agree: you’ve got some great results! Now, it’s time to share it with the engineers in your team to scale it to more data; the only problem is that the engineers are not so familiar with data science and machine learning, and they immediately ask “what are these string values at the top of the evaluate_model function?”.

Instead of saying “they’re features and a target for our model, silly engineer!”, we can answer all engineers’ such questions permanently by extracting the requisite variables from our code

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier


#### EXTRACTED VARIABLES ####
MODEL_FEATURES = ["is_male", "SibSp", "Pclass", "Fare"]
MODEL_TARGET = "Survived"
TRAIN_TEST_SPLIT_FRACTION = 0.2


def load_titanic_passengers_df():
    titanic_passengers_df = pd.read_csv('../data/titanic_dataset.csv')
    titanic_passengers_df["is_male"] = titanic_passengers_df["Sex"] == "male"
    return titanic_passengers_df


def evaluate_model(data, model_constructor):
    X = data[MODEL_FEATURES]
    y = data[MODEL_TARGET]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TRAIN_TEST_SPLIT_FRACTION)
    
    model = model_constructor()
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    return accuracy


titanic_passengers_df = load_df()

accuracy_adaboost = evaluate_model(titanic_passengers_df, AdaBoostClassifier)
print(f"AdaBoost Accuracy: {accuracy_adaboost:.3f}")

accuracy_random_forest = evaluate_model(titanic_passengers_df, RandomForestClassifier)
print(f"Random Forest Accuracy: {accuracy_random_forest:.3f}")

accuracy_xgboost = evaluate_model(df, XGBClassifier)
print(f"XGBoost Accuracy: {accuracy_xgboost:.3f}")


With the `MODEL_FEATURES` and `MODEL_TARGET` variables extracted, we significantly decrease ambiguity for future readers (and we extracted the `TRAIN_TEST_SPLIT_FRACTION` in case of future questions regarding that as well).

In summary, we:

Fixed duplicated code by pulling up the `evaluate_model()` function;
Fixed the mysterious df name by renaming it to `titanic_passengers_df`; and
Fixed the magic values in our code by extracting the variables `MODEL_FEATURES`, `MODEL_TARGET`, and `TRAIN_TEST_SPLIT_FRACTION`.
Adding that extra evaluation for XGBoost was not only easier, but our code has become more readable to all audiences who are likely to interact with it.

There’s still plenty more refactoring that can be done to this code, but for now, it’s enough. Once it’s time to add the next piece of new functionality to the code, then the next most important refactoring will reveal itself.

### Using the dataclasseses Module

In Python, a data class is a class that is designed to only hold data values. They aren't different from regular classes, but they usually don't have any other methods. They are typically used to store information that will be passed between different parts of a program or a system.

However, when creating classes to work only as data containers, writing the __init__ method repeatedly can generate a great amount of work and potential errors.

The dataclasses module, a feature introduced in Python 3.7, provides a way to create data classes in a simpler manner without the need to write methods. 

In [4]:
class ManualComment:
    def __init__(self, id: int, text: str):
        self.id: int = id
        self.text: str = text

    # @property
    def id(self) -> int:
        return self.id
    
    def text(self) -> str:
        return self.text

    def __repr__(self):
        return "{}(id={}, text={})".format(self.__class__.__name__, self.id, self.text)

    def __eq__(self, other):
        if other.__class__ is self.__class__:
            return (self.id, self.text) == (other.id, other.text)
        else:
            return NotImplemented

    def __ne__(self, other):
        result = self.__eq__(other)
        if result is NotImplemented:
            return NotImplemented
        else:
            return not result

    def __hash__(self):
        return hash((self.__class__, self.id, self.text))

    def __lt__(self, other):
        if other.__class__ is self.__class__:
            return (self.id, self.text) < (other.id, other.text)
        else:
            return NotImplemented

    def __le__(self, other):
        if other.__class__ is self.__class__:
            return (self.id, self.text) <= (other.id, other.text)
        else:
            return NotImplemented

    def __gt__(self, other):
        if other.__class__ is self.__class__:
            return (self.id, self.text) > (other.id, other.text)
        else:
            return NotImplemented

    def __ge__(self, other):
        if other.__class__ is self.__class__:
            return (self.id, self.text) >= (other.id, other.text)
        else:
            return NotImplemented

Here's how you would create the same `Comment` representation using a dataclass

In [13]:
import dataclasses
from dataclasses import dataclass, field

@dataclass(frozen=True, order=True)
class Comment:
    id: int
    text: str = ""
    # replies: list[int] = field(default_factory=list, repr=False, compare=False)

In [14]:
import inspect
from pprint import pprint

comment = Comment(1, "I just subscribed!")
# comment.id = 3  # can't immutable
print(comment)
# print(dataclasses.astuple(comment))
# print(dataclasses.asdict(comment))

# pprint(inspect.getmembers(Comment, inspect.isfunction))

Comment(id=1, text='I just subscribed!')
[('__delattr__', <function Comment.__delattr__ at 0x00000211A3CEE4D0>),
 ('__eq__', <function Comment.__eq__ at 0x00000211A3CEDE10>),
 ('__ge__', <function Comment.__ge__ at 0x00000211A3CEE320>),
 ('__gt__', <function Comment.__gt__ at 0x00000211A3CEE170>),
 ('__hash__', <function Comment.__hash__ at 0x00000211A3CEE560>),
 ('__init__', <function Comment.__init__ at 0x00000211A3CED6C0>),
 ('__le__', <function Comment.__le__ at 0x00000211A3CEE050>),
 ('__lt__', <function Comment.__lt__ at 0x00000211A3CECA60>),
 ('__repr__', <function Comment.__repr__ at 0x00000211A3CED510>),
 ('__setattr__', <function Comment.__setattr__ at 0x00000211A3CEE3B0>)]
