In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

## Load the dataset into `pandas`
The file is in this directory (`SMSSpamCollection` with no file extension). This is a dataset of text messages, some are spam and some are not (also known as ham).

Original source: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

1. You'll need to create your own column names: `label` and `message`
2. The data is tab separated, not comma separated

In [2]:
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Preprocessing
For the label column, covert `'ham'` to 0 and `'spam'` to 1

In [3]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Baseline accuracy
What is the baseline accuracy of this dataset?

In [5]:
1 - df['label'].mean()

0.8659368269921034

## Feature extraction

Using `sklearn`'s `FunctionTransformer` class, create several functions using the `str.contains()` method. One has been set up as an example for you below:

In [69]:
def has_money_symbol(df):
    return df['message'].str.contains('[$£€]').astype(int).to_frame()
has_money_symbol_tf = FunctionTransformer(has_money_symbol, validate=False)

def has_number_large_number(df):
    return df['message'].str.contains('\d{4,}').astype(int).to_frame()
has_number_large_number_tf = FunctionTransformer(has_number_large_number, validate=False)

def has_yelling(df):
    return df['message'].str.contains('[A-Z]{4,}').astype(int).to_frame()
has_yelling_tf = FunctionTransformer(has_yelling, validate=False)

def is_exclaiming(df):
    return df['message'].str.contains('!').astype(int).to_frame()
is_exclaiming_tf = FunctionTransformer(is_exclaiming, validate=False)

def has_domain_name(df):
    return df['message'].str.contains('\.[\w]{2,}').astype(int).to_frame()
has_domain_name_tf = FunctionTransformer(has_domain_name, validate=False)

def has_special_characters(df):
    return df['message'].str.contains('[\/<>&:-=+@]').astype(int).to_frame()
has_special_characters_tf = FunctionTransformer(has_special_characters, validate=False)

## Feature union
Combine all your function transformers into a feature union

In [70]:
fu = FeatureUnion([
    ('has_money_symbol_tf', has_money_symbol_tf),
    ('has_number_large_number_tf', has_number_large_number_tf),
    ('has_yelling_tf', has_yelling_tf),
    ('is_exclaiming_tf', is_exclaiming_tf),
    ('has_domain_name_tf', has_domain_name_tf),
    ('has_special_characters_tf', has_special_characters_tf)
])

## Pipeline
Create a pipeline with two components:
1. The `FeatureUnion` you set up in the previous step
2. The `LogisticRegression` from `sklearn`

In [71]:
pipe = Pipeline([
    ('fu', fu),
    ('lr', LogisticRegression())
])

## Cross Validation
Using only the features you've created in the Feature Extraction step, see what accuracy score you can get with your **untuned** pipeline model from the previous step. You'll need the `cross_val_score` function for this step.

Some suggestions:
1. Look at a random sampling of spam messages and see what regex patterns you can glean from your observations.
2. If you're testing an idea, use `df.loc[]` and take the mean of the `label`. For example:

```python
# Testing percentage of spam messages that conain a forward slash
df.loc[df['message'].str.contains('/'), 'label'].mean()
```

In [72]:
cross_val_score(pipe, df, df['label']).mean()

0.97595111853847172