1. **Data Processing Pipelines**

    When working with data, it's common to have a series of steps that need to be performed in sequence, such as loading, cleaning, transforming, and analyzing the data. A class can encapsulate these steps into methods and maintain relevant state, such as configuration settings or intermediate data.

In [1]:
!jupyter nbconvert --to html DS_classes.ipynb

[NbConvertApp] Converting notebook DS_classes.ipynb to html
[NbConvertApp] Writing 288174 bytes to DS_classes.html


In [None]:
import pandas as pd

class DataPipeline:
    def __init__(self, source):
        self.source = source
        self.data = None

    def load_data(self):
        self.data = pd.read_csv(self.source)
        return self

    def clean_data(self):
        self.data.dropna(inplace=True)
        return self

    def transform_data(self, transformations):
        for transformation in transformations:
            self.data = transformation(self.data)
        return self

    def analyze_data(self):
        return self.data.describe()

# Usage
pipeline = DataPipeline("data.csv")
report = (pipeline.load_data()
                 .clean_data()
                 .transform_data([lambda x: x[x.columns[0]] * 2])  # Example transformation
                 .analyze_data())
print(report)

2. **Modeling Classes**

    Data scientists often develop and compare different statistical models or machine learning algorithms. A class can be used to encapsulate each model, including its training, prediction, and evaluation methods.

In [None]:
class Model:
    def __init__(self, algorithm):
        self.algorithm = algorithm
        self.model = None

    def train(self, X_train, y_train):
        self.model = self.algorithm.fit(X_train, y_train)

    def predict(self, X_test):
        return self.model.predict(X_test)

    def evaluate(self, X_test, y_test):
        predictions = self.predict(X_test)
        return accuracy_score(y_test, predictions)

# Usage
from sklearn.ensemble import RandomForestClassifier
model = Model(RandomForestClassifier())
model.train(X_train, y_train)
print("Accuracy:", model.evaluate(X_test, y_test))

3. **Feature Engineering**

    In many data projects, creating and managing features is a repetitive and complex task. A class can help encapsulate feature creation methods, enabling reusable and maintainable code.

In [None]:
class FeatureEngineer:
    def __init__(self, data):
        self.data = data

    def add_date_features(self, date_column):
        self.data[f"{date_column}_year"] = self.data[date_column].dt.year
        self.data[f"{date_column}_month"] = self.data[date_column].dt.month
        self.data[f"{date_column}_day"] = self.data[date_column].dt.day
        return self

    def log_transform(self, column):
        self.data[f"log_{column}"] = np.log(self.data[column] + 1)
        return self

# Usage
engineer = FeatureEngineer(df)
engineer.add_date_features('purchase_date').log_transform('price')

4. **Simulation and Statistical Tests**

    For analysts conducting simulations or statistical tests, classes can be useful to manage the simulation parameters, methods, and results in a structured way.

In [None]:
class Simulation:
    def __init__(self, params):
        self.params = params
        self.results = []

    def run_simulation(self):
        for _ in range(self.params['n_runs']):
            result = np.random.binomial(n=self.params['n'], p=self.params['p'])
            self.results.append(result)

    def summary_statistics(self):
        return np.mean(self.results), np.std(self.results)

# Usage
sim = Simulation({'n': 10, 'p': 0.5, 'n_runs': 100})
sim.run_simulation()
print("Mean and SD:", sim.summary_statistics())