# Prepare the AI Spam Classifier Dataset
We'll be combining 2 open source datasets curated by [The University of California, Irvine (UCI)](https://archive.ics.uci.edu):

- Spam SMS ([source](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection))
- YouTube Spam ([source](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection))


#### Requirements
- Python
- Jupyter (Setup with [this video](https://www.youtube.com/watch?v=9tPS-7TWjq0))
- Pandas

### Step 1. Download Datasets

#### Create destination folders

In [1]:
import pathlib

USE_PROJECT_ROOT = True
BASE_DIR = pathlib.Path(".").resolve()
if USE_PROJECT_ROOT:
    BASE_DIR = BASE_DIR.parent.parent
DATASET_DIR = BASE_DIR / "datasets"
ZIPS_DIR = DATASET_DIR / 'zips'
EXPORT_DIR = DATASET_DIR / "exports"
SMS_SPAM_DIR = DATASET_DIR / 'imports' / 'sms-spam'
YOUTUBE_SPAM_DIR = DATASET_DIR / 'imports' / 'youtube-spam'

In [2]:
ZIPS_DIR.mkdir(exist_ok=True, parents=True)

EXPORT_DIR.mkdir(exist_ok=True, parents=True)

SMS_SPAM_DIR.mkdir(exist_ok=True, parents=True)

YOUTUBE_SPAM_DIR.mkdir(exist_ok=True, parents=True)

You could also create the directories using:

```
!mkdir -p $DATASET_DIR/zips/
!mkdir -p $SMS_SPAM_DIR
!mkdir -p $YOUTUBE_SPAM_DIR
!mkdir -p $EXPORT_DIR
```

#### UCI Spam SMS
Source: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [3]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o $ZIPS_DIR/uci-sms-spam.zip
!unzip -o $ZIPS_DIR/uci-sms-spam.zip -d $SMS_SPAM_DIR

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k  100  198k    0     0   422k      0 --:--:-- --:--:-- --:--:--     0-:--:-- --:--:-- --:--:--  422k
Archive:  /Users/cfe/Dev/ai-microservice/datasets/zips/uci-sms-spam.zip
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/sms-spam/SMSSpamCollection  
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/sms-spam/readme  


#### YouTube Spam
Source: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection

In [4]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip -o $ZIPS_DIR/uci-youtube-spam.zip
!unzip -o $ZIPS_DIR/uci-youtube-spam.zip -d $YOUTUBE_SPAM_DIR

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  159k  100  159k    0     0   359k      0 --:--:-- --:--:-- --:--:--  359k
Archive:  /Users/cfe/Dev/ai-microservice/datasets/zips/uci-youtube-spam.zip
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/Youtube01-Psy.csv  
   creating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/__MACOSX/
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/__MACOSX/._Youtube01-Psy.csv  
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/Youtube02-KatyPerry.csv  
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/__MACOSX/._Youtube02-KatyPerry.csv  
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/Youtube03-LMFAO.csv  
  inflating: /Users/cfe/Dev/ai-microservice/datasets/imports/youtube-spam/__MACOSX/._Youtube03-LM

### Step 2. Load Datasets into a Pandas DataFrame

In [5]:
import pandas as pd

**Load the `sms-spam` dataset into a pandas dataframe**

In [6]:
sms_path = SMS_SPAM_DIR / 'SMSSpamCollection'
sms_df = pd.read_csv(str(sms_path), sep='\t', header=None)

Now set the headers

In [7]:
sms_df.columns = ['label', 'text']
sms_df['source'] = 'uci-spam-sms'

**Load the `youtube-spam` datasets into a pandas dataframe**

The youtube-spam dataset is stored across multiple csvs. Let's combine them into 1 big file.

In [8]:
location = YOUTUBE_SPAM_DIR
csvs = list(location.glob("*.csv"))

In [9]:
new_dfs = []
for csv in csvs:
    csv_df = pd.read_csv(str(csvs[0]), usecols=['CLASS', 'CONTENT'])
    csv_df.rename(columns={'CLASS': 'class', "CONTENT": 'text'}, inplace=True)
    csv_df['label'] = csv_df['class'].apply(lambda x: "spam" if str(x) == "1" else "ham")
    sub_df = csv_df.copy()[['label', 'text']] 
    new_dfs.append(sub_df)

yt_df = pd.concat(new_dfs)
yt_df['source'] = 'uci-youtube-spam'

**Combine the `sms-spam` dataset and the `youtube-spam` dataset**

In [10]:
df = pd.concat([sms_df, yt_df])

### Step 3. Export Complete Dataset

In [11]:
df.to_csv(EXPORT_DIR / 'spam-dataset.csv', index=False)