# 1. Custom classes

Let us first describe my implementation of 2 (of the 3) main classes which drive the execution of the program. To fully understand them it is important to note that:

- They were amended to be parsed as [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)'s transformations: in particular, they (need to) **implement** the methods `fit` and `transform` such that:
  - `fit` & `transform` have same signature: they accept the features' dataframe (and labels' to follow `sklearn` conventions, but it is not used)
  - `fit` returns the class itself, while `transform` returns the transformation output (it can be a `pandas.DataFrame`, `numpy.ndarray` or `scipy.sparse.csr.csr_matrix`)

- The pippeline contains the following transformations prior to the model fitting:
  1. Questions' text preprocessing, through `TextPreprocessor`class
  2. Feature (and extra features) generation:
    - 2.1. The feature vectors are found, for each question, with the `FeatureGenerator` class; and then they are aggregated to convert them in a single feature vector (subtracted, compared, stacked...).
    - 2.2. The extra features are added in form of column to the former feature matrix with the `ExtraFeaturesCreator` (class initialized and called from `FeatureGenerator`)

- Several methods are available for each part of the pipeline and they are listed/stored in private global variables (at module level). These are:
  - Feature vector extraction: stored at `_SUPPORTED_EXTRACTORS`, these are:
    - `CountVectorizer` (labeled as 'cv' and 'cv_2w' depending on `ngram_range`)
    - `TF-IDF` (labeled as 'tf_idf' and 'tf_idf_2w' depending on `ngram_range`)
    - `spacy` embedding (labeled as 'spacy_small' and 'spacy_medium', but they are DISMISSED due to computational loads)
  - Feature vector aggregation: stored at `_SUPPORTED_AGGREGATORS`, these are:
    - `stack`: horizontally stacking the feature vectors
    - `absolute`: computing the absolute difference between feature vectors
    - `cosine`: computing the cosine similarity between feature vectors (DISMISSED because the outputs were dense, not sparse, are did not fit in memory no matter what we tried)
  - Extra features creation: stored at `_SUPPORTED_EXTRA_FEATURES`:
    - Several functions available: every of them receives 2 list of words (`str`) representing the questions' words list (after preprocessing) and must return a `float`. Some examples, reviewed below, are: `_length_ratio`, `_get_coincident_words_ratio`

## 1.1. `FeatureGenerator` 

```python
class FeatureGenerator:
    def __init__(self,
                 exts: Tuple = ('cv', ),
                 aggs: Tuple = ('stack', ),
                 extra_features: Tuple[str] = 'all') -> None:
        assert len(exts) == len(aggs), \
            "Extractor and aggregator lists must be of the same length"
        self.extractors = [_get_extractor(ext) for ext in exts]
        self.extractor_names: Tuple[str] = exts
        self.aggregators = [_SUPPORTED_AGGREGATORS[agg] for agg in aggs]
        self.extra_features_creator = ExtraFeaturesCreator(extra_features)

    def set_params(self,
                   exts: Tuple = ('cv', ),
                   aggs: Tuple = ('stack', ),
                   extra_features: Tuple[str] = 'all') -> None:
        self.__class__(exts, aggs, extra_features)

    def fit(self, questions_df: pd.DataFrame, y=None):
        self.extractors = [ext if name.startswith('spacy') else ext.fit(
            questions_df.values.flatten()) for name, ext in zip(
            self.extractor_names, self.extractors)]
        return self

    def transform(self, questions_df: pd.DataFrame, y=None):
        agg_features = []
        for name, ext, agg in zip(self.extractor_names, self.extractors,
                                  self.aggregators):
            # we apply the extractor to each question
            if name.startswith('spacy'):
                print("Using spacy word embedding: please WAIT, "
                      "this may take some time")
                # then we use a spacy embedding
                x_q1 = questions_df.iloc[:, 0].apply(
                    lambda x: ext(x).vector)
                x_q2 = questions_df.iloc[:, 1].apply(
                    lambda x: ext(x).vector)

            else:
                x_q1 = ext.transform(questions_df.iloc[:, 0])
                x_q2 = ext.transform(questions_df.iloc[:, 1])

            # and we aggregate them
            x_agg = agg(x_q1, x_q2)
            agg_features.append(x_agg)

        if len(self.extra_features_creator.features_functions) != 0:
            # in parallel, we compute the extra features
            x_extra: np.ndarray = self.extra_features_creator.transform(
                questions_df)
            # finally, we merge them
            return hstack((hstack(agg_features), x_extra))
        return hstack(agg_features)


def _get_extractor(ext: str):
    if ext in ['spacy_small', 'spacy_medium']:
        _spacy_version: str = _SUPPORTED_EXTRACTORS[ext]
        try:
            import spacy
            spacy.load(_spacy_version)
        except OSError:
            os.system(f'python -m spacy download {_spacy_version}')
        finally:
            import spacy
            return spacy.load(_spacy_version)
    else:
        return _SUPPORTED_EXTRACTORS[ext]
```

In summary, the `transform` method of this class drives it usability and:
- Extracts the feature vectors (given the dataframe of the preprocessed questions' text) for each sample
- Aggregates them in a single feature vector (conforming a matrix, rows as samples)
- Finally, adds the extra features (created with `ExtraFeaturesCreator`) to the aggregated feature vectors' matrix

The `fit` method just fits the feature vector 'extractors' (i.e. TF-IDF and CV) with the whole text (of both questions, for all samples). Finally, the `__init__` method allows the user to select which techniques can be used for each of the former tasks.

It is worth mentioning the utility that Claudia, [@claudia-hm](https://github.com/claudia-hm), implemented for this class:
- **More than one kind of extraction and aggregation can be used**. Namely, we as feature vector we can set the stacking of 2 feature vectors: 1 coming from one kind of extraction/aggregation (i.e. 'cv' & 'stack') and 1 coming from another kind (i.e. 'tf_idf' & 'absolute'). This notably raises the achieved performance.

## 1.2. `ExtraFeaturesCreator`

```python
class ExtraFeaturesCreator:
    def __init__(self, features_to_add: Union[Tuple, str]) -> None:
        if isinstance(features_to_add, str):
            assert features_to_add == 'all', "Unrecognized extra features list"
            self.features_functions: Dict[str, callable] = \
                _SUPPORTED_EXTRA_FEATURES
        else:
            self.features_functions = {
                _n: _SUPPORTED_EXTRA_FEATURES[_n] for _n in features_to_add}

    def transform(self, questions_df: pd.DataFrame) -> np.ndarray:
        if len(self.features_functions) == 0:
            raise ValueError("There is no extra features to be aggregated")

        for _c in ('question1', 'question2'):
            questions_df[_c] = questions_df[_c].str.split()
        extra_features = pd.DataFrame()

        for _f_name, _f_function in self.features_functions.items():
            extra_features[_f_name] = questions_df.apply(
                lambda x: _f_function(x.question1, x.question2), axis=1)

        return extra_features.values
```

Since it is called from `FeatureGenerator`, and not from `sklearn.pipeline.Pipeline`, just needs to implement a `transform` method (no `fit`). This simply receives a dataframe containing the 2 questions' text and, for each selected technique to generate extra features, computes (as column) the extra feature vectors. These are finally returned as `np.ndarray`.

# 2. Custom functions

Henceforward, the functions I implemented are presented. To be exact, their docstrings, in which my explanation is includedm are shown: 

In [5]:
from utils import remove_nan_questions, _horizontal_stacking, _length_ratio, _get_coincident_words_ratio

In [6]:
help(remove_nan_questions)

Help on function remove_nan_questions in module utils:

remove_nan_questions(x_train: pandas.core.frame.DataFrame, y_train: pandas.core.frame.DataFrame) -> Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]
    Remove those samples which contain a NaN in at least one question.
    
    Parameters
    ----------
    x_train: pd.DataFrame
        Two columns dataframe containing the feature questions
    y_train: pd.DataFrame
        One column dataframe containing the labels
    
    Returns
    -------
    dropped_x_train, dropped_y_train: Tuple[pd.DataFrame, pd.DataFrame]
        Dataframes without NaN in any sample



In [7]:
help(_horizontal_stacking)

Help on function _horizontal_stacking in module utils:

_horizontal_stacking(x_q1: scipy.sparse.csr.csr_matrix, x_q2: scipy.sparse.csr.csr_matrix) -> scipy.sparse.csr.csr_matrix
    Stack horizontally the 2 passed feature matrices
    
    Parameters
    ----------
    x_q1: csr_matrix
        Feature (sparse) matrix (each row is the feature vector
        obtained from the first question)
    x_q2: csr_matrix
        Feature (sparse) matrix of the second question
    
    Returns
    -------
    Feature (sparse) matrix with the questions merged



In [8]:
help(_length_ratio)

Help on function _length_ratio in module utils:

_length_ratio(q1_w: List[str], q2_w: List[str]) -> float
    Return the question's length ratio (the first with respect the second).
    If any of them has 0 length (after preprocessing), the retrieved ratio
    is 0.
    
    Parameters
    ----------
    q1_w: List[str]
        List of words contained in the first (preprocessed) question's samples
    q2_w: List[str]
        List of words contained in the second (preprocessed) question's samples
    
    Returns
    -------
    ratio: float
        Question's length ratio



In [9]:
help(_get_coincident_words_ratio)

Help on function _get_coincident_words_ratio in module utils:

_get_coincident_words_ratio(q1_w: List[str], q2_w: List[str]) -> float
    Count the ratio of coincident words with respect the total number of them.
    This is applied at a sample level.
    
    Parameters
    ----------
    q1_w: List[str]
        List of words contained in the first (preprocessed) question's samples
    q2_w: List[str]
        List of words contained in the second (preprocessed) question's samples
    
    Returns
    -------
    ratio: float
        Ratio of coincident words (between the 2 questions)

