# Single Responsibility Principle

Here is an example of a data science related class in Python that violates the Single Responsibility Principle (SRP). 

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

class DataLoader:
    def __init__(self, file_path):
        self.file_path = file_path

    def load_data(self):
        return pd.read_csv(self.file_path)

    def preprocess_data(self, data):
        # Perform data preprocessing tasks (e.g., handle missing values, encode categorical variables)
        return data.dropna().astype(float)

    def split_data(self, data):
        X = data.drop('target', axis=1)
        y = data['target']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        return X_train, X_test, y_train, y_test

    def train_model(self, X_train, y_train):
        model = LinearRegression()
        model.fit(X_train, y_train)
        return model

    def evaluate_model(self, model, X_test, y_test):
        y_pred = model.predict(X_test)
        return mean_squared_error(y_test, y_pred)

data_loader = DataLoader('..\\data\\sdg_electricity_data.csv')
data = data_loader.load_data()
# data = data_loader.preprocess_data(data)
# X_train, X_test, y_train, y_test = data_loader.split_data(data)
# model = data_loader.train_model(X_train, y_train)
# mse = data_loader.evaluate_model(model, X_test, y_test)
# print(f'Mean Squared Error: {mse:.2f}')


In this example, the DataLoader class has multiple responsibilities:
1. Loading data from a file
2. Preprocessing data
3. Splitting data into training and testing sets
4. Training a machine learning model
5. Evaluating the performance of the model

This violates the Single Responsibility Principle, which states that a class should have only one reason to change. In this case, the `DataLoader` class has multiple reasons to change, making it difficult to maintain and extend.

To fix this violation, we can break down the DataLoader class into separate classes, each with a single responsibility:
- DataLoader: responsible for loading data from a file
- DataPreprocessor: responsible for preprocessing data
- DataSplitter: responsible for splitting data into training and testing sets
- ModelTrainer: responsible for training a machine learning model
- ModelEvaluator: responsible for evaluating the performance of the model

By separating these responsibilities into individual classes, we can make the code more modular, maintainable, and scalable.

Each class has a single responsibility, making the code more modular, maintainable, and scalable. Fix the SRP problem by breaking down the DataLoader class into separate classes, each with a single responsibility by completing the 4 TODOs below. 


In [None]:
import pandas as pd
class DataLoader:
    def __init__(self, file_path):
        self.file_path = file_path

def load_data(self):
        return pd.read_csv(self.file_path)

# data_preprocessor.py
class DataPreprocessor:
    def preprocess_data(self, data):
        # TODO 1: Perform data preprocessing tasks (e.g., handle missing values, encode categorical variables)
        pass

# data_splitter.py
from sklearn.model_selection import train_test_split
class DataSplitter:
    def split_data(self, data):
        # TODO 2: Split the data into training and testing sets
        pass

# model_trainer.py
from sklearn.linear_model import LinearRegression
class ModelTrainer:
    def train_model(self, X_train, y_train):
        # TODO 3: Train a machine learning model
        pass

# model_evaluator.py
from sklearn.metrics import mean_squared_error
class ModelEvaluatorClass:
    def evaluate_model(self, model, X_test, y_test):
        # TODO 4: Evaluate the model
        pass

# main.py
data_loader = DataLoader('..\\data\\sdg_electricity_data.csv')
data = data_loader.load_data()
data_preprocessor = DataPreprocessor()
data = data_preprocessor.preprocess_data(data)
data_splitter = DataSplitter()
X_train, X_test, y_train, y_test = data_splitter.split_data(data)
model_trainer = ModelTrainer()
model = model_trainer.train_model(X_train, y_train)
model_evaluator = ModelEvaluatorClass()
mse = model_evaluator.evaluate_model(model, X_test, y_test)
print(f'Mean Squared Error: {mse:.2f}')


# Open-Closed Principle

Here is an example of a data science related class in Python that violates the Open-Closed Principle (OCP). 

In [None]:
class DataAnalyzer:
    def __init__(self, data):
        self.data = data

def analyze_data(self):
        if self.data.shape[0] < 100:
            return self._analyze_small_data()
        elif self.data.shape[0] < 1000:
            return self._analyze_medium_data()
        else:
            return self._analyze_large_data()

def _analyze_small_data(self):
        # Analyze small data using a simple method
        return "Small data analysis result"

def _analyze_medium_data(self):
        # Analyze medium data using a moderately complex method
        return "Medium data analysis result"

def _analyze_large_data(self):
        # Analyze large data using a complex method
        return "Large data analysis result"

data_analyzer = DataAnalyzer(some_data)
result = data_analyzer.analyze_data()
print(result)


In this example, the DataAnalyzer class has a single method analyze_data that analyzes the data based on its size. The method uses an if-elif-else statement to determine which analysis method to use, depending on the size of the data. 

This design violates the Open-Closed Principle because the DataAnalyzer class is not open for extension but closed for modification. If we want to add a new analysis method for a different data size, we would have to modify the existing analyze_data method, which breaks the OCP.

To fix this violation, we can use polymorphism and inheritance to create a more flexible and extensible design. For example, we can create an abstract base class DataAnalyzer with an abstract method analyze_data, and then create concrete subclasses for each data size range that implement the analyze_data method. This way, we can add new analysis methods without modifying the existing code.

You need to fix this violation by creating an abstract base class DataAnalyzer with an abstract method analyze_data, and then create concrete subclasses for each data size range that implement the analyze_data method. 

The create_data_analyzer function is used to create an instance of the appropriate subclass based on the size of the data.

In [None]:
from abc import ABC, abstractmethod

class DataAnalyzer(ABC):
    def __init__(self, data):
        self.data = data

    @abstractmethod
    def analyze_data(self):
        pass

# TODO: Create the three different classes for small, medium, and large data analysis

def create_data_analyzer(data):
    if data.shape[0] < 100:
        return SmallDataAnalyzer(data)
    elif data.shape[0] < 1000:
        return MediumDataAnalyzer(data)
    else:
        return LargeDataAnalyzer(data)

data_analyzer = create_data_analyzer(some_data)
result = data_analyzer.analyze_data()
print(result)


This design satisfies the Open-Closed Principle because we can add new analysis methods for different data size ranges without modifying the existing code. We can simply create a new subclass that implements the analyze_data method, and the create_data_analyzer function will take care of instantiating the correct subclass.
This design is more flexible, extensible, and maintainable than the original code.

# Liskov Substitution Principle

Here is another example of a data science related class in Python that violates the Liskov Substitution Principle (LSP). 

In [None]:
class DataLoader:
    def load_data(self, file_path):
        return pd.read_csv(file_path)

class CSVDataLoader(DataLoader):
    def load_data(self, file_path):
        return pd.read_csv(file_path)

class JSONDataLoader(DataLoader):
    def load_data(self, file_path):
        with open(file_path, 'r') as f:
            data = json.load(f)
        return pd.DataFrame(data)

data_loader = JSONDataLoader()
data = data_loader.load_data('data.json')
print(data)


In this example, the JSONDataLoader class inherits from the DataLoader class, but it does not respect the contract defined by the DataLoader class. The load_data method in JSONDataLoader returns a pd.DataFrame object, whereas the load_data method in DataLoader returns a pd.DataFrame object or raises a ValueError if the file path is invalid.
This violates the Liskov Substitution Principle because we cannot use a JSONDataLoader object in place of a DataLoader object without changing the behavior of the program. The JSONDataLoader class is not a true substitute for the DataLoader class.

To fix this violation, we can modify the JSONDataLoader class to handle the file path in a way that is consistent with the contract defined by the DataLoader class. For example, we could modify the load_data method to raise a ValueError if the file path is invalid, or to return a default value if the file is empty.

You need to fix this violation by modifying the JSONDataLoader class to handle the file path in a way that is consistent with the contract defined by the DataLoader class. Specifically, add a check to ensure that the file path ends with .json, and we raise a ValueError if it does not.

In [None]:
class DataLoader:
    def load_data(self, file_path):
        if not file_path.endswith('.csv'):
            raise ValueError("Invalid file path. Only CSV files are supported.")
        return pd.read_csv(file_path)

class CSVDataLoader(DataLoader):
    def load_data(self, file_path):
        # TODO 1: Implement loading data from a CSV file in a way that's consistent with the contract defined by the base class
        pass

class JSONDataLoader(DataLoader):
    def load_data(self, file_path):
        # TODO 2: Implement loading data from a JSON file in a way that's consistent with the contract defined by the base class
        pass

data_loader = JSONDataLoader()
data = data_loader.load_data('data.json')
print(data)

In this modified code, we have modified the JSONDataLoader class to handle the file path in a way that is consistent with the contract defined by the DataLoader class. 

By modifying the JSONDataLoader class to handle the file path in a way that is consistent with the contract defined by the DataLoader class, we have fixed the Liskov Substitution Principle violation. We can now use the JSONDataLoader class in place of the DataLoader class without changing the behavior of the program.

Note that we have also added a similar check to the DataLoader class to ensure that it only supports CSV files. This ensures that the DataLoader class and its subclasses have a consistent contract. 

# Interface Segregation Principle

Here's an example of a class that violates the Interface Segregation Principle (ISP)

In [None]:
class Model:
    def __init__(self, data):
        self.data = data

    def train_linear_regression(self):
        # implementation to train a linear regression model
        pass

    def train_logistic_regression(self):
        # implementation to train a logistic regression model
        pass

    def train_decision_tree(self):
        # implementation to train a decision tree model
        pass

    def train_random_forest(self):
        # implementation to train a random forest model
        pass

    def train_neural_network(self):
        # implementation to train a neural network model
        pass

    def evaluate_model(self):
        # implementation to evaluate the model
        pass

    def make_prediction(self):
        # implementation to make a prediction using the model
        pass

# Usage:
model = Model(data)
model.train_linear_regression()
model.evaluate_model()
model.make_prediction()

In this example, the Model class has a large interface that includes multiple methods for training different types of machine learning models (linear regression, logistic regression, decision tree, random forest, neural network), evaluating the model, and making predictions.

The problem with this class is that it forces clients to depend on a large interface, even if they only need to use a subset of the methods. For example, a client that only needs to train a linear regression model and make predictions is forced to depend on the entire interface, including the methods for training other types of models and evaluating the model.

This violates the Interface Segregation Principle, which states that "clients should not be forced to depend on interfaces they don't use". A better design would be to break down the interface into smaller, more focused interfaces that cater to specific use cases.

By breaking down the interface into smaller, more focused interfaces, we can reduce the coupling between clients and the DataProcessor class, and make the system more modular and flexible. Fix the ISP violation by creating separate interfaces for LinearRegressionModel, LogisticRegressionModel and ModelEvaluator, based on the usage pattern shared below. 

In [None]:
# TODO: Fix the ISP violation by creating separate abstractions for LinearRegressionModel, LogisticRegressionModel and ModelEvaluator, based on the usage pattern shared below

# Usage:
linear_regression_model = LinearRegressionModel(data)
linear_regression_model.train()
linear_regression_model.make_prediction()

logistic_regression_model = LogisticRegressionModel(data)
logistic_regression_model.train()
logistic_regression_model.make_prediction()

model_evaluator = ModelEvaluator(linear_regression_model)
model_evaluator.evaluate()

In this updated design, we've broken down the interface into smaller, more focused interfaces that cater to specific use cases. Each interface has a smaller set of methods that are relevant to that specific use case. This makes it easier for clients to depend on only the interfaces they need, without being forced to depend on a large interface.

# Dependency Inversion Principle

Here's another example of a class that violates the Dependency Inversion Principle (DIP)

In [None]:
import matplotlib.pyplot as plt

class DataVisualizer:
    def __init__(self, data):
        self.data = data

    def visualize(self):
        plt.plot(self.data)
        plt.xlabel('X Axis')
        plt.ylabel('Y Axis')
        plt.title('Data Visualization')
        plt.show()

# Create a sample dataset
data = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [2, 4, 6, 8, 10]})

# Create an instance of the DataVisualizer class
visualizer = DataVisualizer(data['Y'])  # Pass in the 'Y' column of the dataset

# Call the visualize method to create the plot
visualizer.visualize()

In this example, the DataVisualizer class is tightly coupled to the matplotlib library for visualizing data. This violates the Dependency Inversion Principle, which states that:
- High-level modules should not depend on low-level modules. Instead, both should depend on abstractions.
- Abstractions should not depend on details. Details should depend on abstractions.

In this case, the DataVisualizer class is a high-level module that depends on a low-level module (matplotlib) for a specific implementation of data visualization. This makes it difficult to change or replace the visualization library without affecting the DataVisualizer class.

A better design would be to invert the dependencies by introducing an abstraction for data visualization. To fix the DIP violation, implement the MatplotlibVisualizer and DataAnalyzer classes below based on the usage pattern below. 

In [None]:
from abc import ABC, abstractmethod

class DataVisualizer(ABC):
    @abstractmethod
    def visualize(self, data):
        pass

class MatplotlibVisualizer(DataVisualizer):
    # TODO 1: Implement the visualize method using Matplotlib to create a plot
    pass

class DataAnalyzer:
    # TODO 2: Add a visualizer attribute to the DataAnalyzer class
    pass

# Create an instance of MatplotlibVisualizer
visualizer = MatplotlibVisualizer()

# Create an instance of DataAnalyzer, passing in the visualizer
analyzer = DataAnalyzer(visualizer)

# Generate some sample data
data = [1, 2, 3, 4, 5]

# Call the analyze method on the analyzer, passing in the data
analyzer.analyze(data)

In this revised design, we've introduced an abstraction for data visualization (DataVisualizer) and a concrete implementation for it (MatplotlibVisualizer). The DataAnalyzer class depends on the abstraction, rather than the specific implementation. This makes it easier to change or replace the visualization library without affecting the DataAnalyzer class.