# Dataset Source

The ling-spam dataset is collected from https://www.kaggle.com/mandygu/lingspam-dataset/notebooks

# Imports & Installations

In [None]:
!pip install twython

Collecting twython
  Downloading https://files.pythonhosted.org/packages/24/80/579b96dfaa9b536efde883d4f0df7ea2598a6f3117a6dd572787f4a2bcfb/twython-3.8.2-py3-none-any.whl
Installing collected packages: twython
Successfully installed twython-3.8.2


In [None]:
#Import libs
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier,ExtraTreesClassifier
from collections import Counter
import string
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings('ignore')
import re
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Collecting Dataset

The reason of getting the dataset from my github repository is that the original csv is hosted at Kaggle which requires account signup to access the dataset thus referaining open access to the dataset. Therefore, github's public repository will make it easier for us to fetch the dataset without needing any permission.

In [None]:
!wget "http://raw.githubusercontent.com/alihussainia/Email-Spam-Classification-Project/master/messages.csv.zip"

URL transformed to HTTPS due to an HSTS policy
--2021-02-04 16:42:15--  https://raw.githubusercontent.com/alihussainia/Email-Spam-Classification-Project/master/messages.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3272271 (3.1M) [application/zip]
Saving to: ‘messages.csv.zip.1’


2021-02-04 16:42:16 (7.69 MB/s) - ‘messages.csv.zip.1’ saved [3272271/3272271]



In [None]:
# Extracting dataset
!unzip messages.csv.zip

Archive:  messages.csv.zip
replace messages.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


#Understanding our dataset

In [None]:
# Loading dataset as a Pandas DataFrame
dataset = pd.read_csv('messages.csv')

In [None]:
dataset.head(10)

Unnamed: 0,subject,message,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0
5,call for abstracts : optimality in syntactic t...,content - length : 4437 call for papers is the...,0
6,m . a . in scandinavian linguistics,m . a . in scandinavian linguistics at the uni...,0
7,call for papers : linguistics session of the m...,call for papers linguistics session - - midwes...,0
8,foreign language in commercials,content - length : 1937 greetings ! i ' m wond...,0
9,fulbright announcement : please post / dissemi...,fulbright announcement : please post / dissemi...,0


# Data Transformation

### Converting Messages to Lower Case

In [None]:
# converting all messages to lower case

dataset['message'] = dataset['message'].str.lower()

### Cleaning Data

We are starting with null value check

In [None]:
# Cleaning Data
# checing null values 
dataset.isnull().sum()

subject    62
message     0
label       0
dtype: int64

From here we can observe that data is missing in the subject column therefore, we are filling the null values with mode values and since, mode returns a Series, so you still need to access the row you want before replacing NaN values in your DataFrame



In [None]:
dataset.fillna(dataset['subject'].mode().values[0],inplace=True)

In [None]:
# let's once again 
dataset.isnull().sum()

subject    0
message    0
label      0
dtype: int64

Now it's looking perfect and move on to next step's .

In [None]:
df = dataset.copy()

To get clarity about mail let's merge both subject and message .

In [None]:
df['sub_mssg']=df['subject']+df['message']
df.head()

Unnamed: 0,subject,message,label,sub_mssg
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0,job posting - apple-iss research centercontent...
1,sociolinguistics,"lang classification grimes , joseph e . and ba...",0,"sociolinguisticslang classification grimes , j..."
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0,query : letter frequencies for text identifica...
3,risk,a colleague and i are researching the differin...,0,riska colleague and i are researching the diff...
4,request book information,earlier this morning i was on the phone with a...,0,request book informationearlier this morning i...


In [None]:
df['sub_mssg'].describe()

count                                                  2893
unique                                                 2876
top       re := 20 the virtual girlfriend and virtual bo...
freq                                                      4
Name: sub_mssg, dtype: object

Adding lenght columns which represents original length of the subject+message

In [None]:
df['length']=df['sub_mssg'].apply(len)
df.head()

Unnamed: 0,subject,message,label,sub_mssg,length
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0,job posting - apple-iss research centercontent...,2895
1,sociolinguistics,"lang classification grimes , joseph e . and ba...",0,"sociolinguisticslang classification grimes , j...",1816
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0,query : letter frequencies for text identifica...,1485
3,risk,a colleague and i are researching the differin...,0,riska colleague and i are researching the diff...,328
4,request book information,earlier this morning i was on the phone with a...,0,request book informationearlier this morning i...,1070


dropping un-necessary features 

In [None]:
df.drop('subject',axis=1,inplace=True)

### Preprocessing Email Messages

In [None]:
def decontact(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
#REPLACING NUMBERS FROM Digits to Words
df['sub_mssg']=df['sub_mssg'].str.replace(r'\d+(\.\d+)?', 'numbers')
df['sub_mssg'][0]

"job posting - apple-iss research centercontent - length : numbers apple-iss research center a us $ numbers million joint venture between apple computer inc . and the institute of systems science of the national university of singapore , located in singapore , is looking for : a senior speech scientist - - - - - - - - - - - - - - - - - - - - - - - - - the successful candidate will have research expertise in computational linguistics , including natural language processing and * * english * * and * * chinese * * statistical language modeling . knowledge of state-of - the-art corpus-based n - gram language models , cache language models , and part-of - speech language models are required . a text - to - speech project leader - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - the successful candidate will have research expertise expertise in two or more of the following areas : computational linguistics , including natural language parsing , lexical database design , and statis

In [None]:
#CONVRTING EVERYTHING TO LOWERCASE
df['sub_mssg']=df['sub_mssg'].str.lower()
#REPLACING NEXT LINES BY 'WHITE SPACE'
df['sub_mssg']=df['sub_mssg'].str.replace(r'\n'," ") 
# REPLACING EMAIL IDs BY 'MAILID'
df['sub_mssg']=df['sub_mssg'].str.replace(r'^.+@[^\.].*\.[a-z]{2,}$','MailID')
# REPLACING URLs  BY 'Links'
df['sub_mssg']=df['sub_mssg'].str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$','Links')
# REPLACING CURRENCY SIGNS BY 'MONEY'
df['sub_mssg']=df['sub_mssg'].str.replace(r'£|\$', 'Money')
# REPLACING LARGE WHITE SPACE BY SINGLE WHITE SPACE
df['sub_mssg']=df['sub_mssg'].str.replace(r'\s+', ' ')

# REPLACING LEADING AND TRAILING WHITE SPACE BY SINGLE WHITE SPACE
df['sub_mssg']=df['sub_mssg'].str.replace(r'^\s+|\s+?$', '')
#REPLACING CONTACT NUMBERS
df['sub_mssg']=df['sub_mssg'].str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$','contact number')
#REPLACING SPECIAL CHARACTERS  BY WHITE SPACE 
df['sub_mssg']=df['sub_mssg'].str.replace(r"[^a-zA-Z0-9]+", " ")

In [None]:
#CONVRTING EVERYTHING TO LOWERCASE
df['message']=df['message'].str.lower()
#REPLACING NEXT LINES BY 'WHITE SPACE'
df['message']=df['message'].str.replace(r'\n'," ") 
# REPLACING EMAIL IDs BY 'MAILID'
df['message']=df['message'].str.replace(r'^.+@[^\.].*\.[a-z]{2,}$','MailID')
# REPLACING URLs  BY 'Links'
df['message']=df['message'].str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$','Links')
# REPLACING CURRENCY SIGNS BY 'MONEY'
df['message']=df['message'].str.replace(r'£|\$', 'Money')
# REPLACING LARGE WHITE SPACE BY SINGLE WHITE SPACE
df['message']=df['message'].str.replace(r'\s+', ' ')

# REPLACING LEADING AND TRAILING WHITE SPACE BY SINGLE WHITE SPACE
df['message']=df['message'].str.replace(r'^\s+|\s+?$', '')
#REPLACING CONTACT NUMBERS
df['message']=df['message'].str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$','contact number')
#REPLACING SPECIAL CHARACTERS  BY WHITE SPACE 
df['message']=df['message'].str.replace(r"[^a-zA-Z0-9]+", " ")

**Removing stopwords**


In [None]:
# # to check the nltk stopwords list
# import nltk
# from nltk.corpus import stopwords
# print(stopwords.words('english'))

# removing stopwords 
import nltk
nltk.download('stopwords')
stop = stopwords.words('english')
df['Cleaned_Text'] = df['sub_mssg'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Dropping redundant features

In [None]:
df.drop('message',axis=1,inplace=True)

In [None]:
df.drop('sub_mssg',axis=1,inplace=True)

In [None]:
# checking null values in df
df.isnull().sum()

label           0
length          0
Cleaned_Text    0
dtype: int64

In [None]:
df['lgth_clean']=df['Cleaned_Text'].apply(len)
df.head()

Unnamed: 0,label,length,Cleaned_Text,lgth_clean
0,0,2895,job posting apple iss research centercontent l...,2108
1,0,1816,sociolinguisticslang classification grimes jos...,1506
2,0,1485,query letter frequencies text identificationi ...,1150
3,0,328,riska colleague researching differing degrees ...,216
4,0,1070,request book informationearlier morning phone ...,653


Dropping counts to visualize dataset

In [None]:
df.drop('length',axis=1,inplace=True) # original subject+message len
df.drop('lgth_clean',axis=1,inplace=True) # after transformation sub+msg len

# Shape of the Dataset

In [None]:
df.shape

(2893, 2)

# First Ten Values of the Dataset

In [None]:
df.head(10)

Unnamed: 0,label,Cleaned_Text
0,0,job posting apple iss research centercontent l...
1,0,sociolinguisticslang classification grimes jos...
2,0,query letter frequencies text identificationi ...
3,0,riska colleague researching differing degrees ...
4,0,request book informationearlier morning phone ...
5,0,call abstracts optimality syntactic theorycont...
6,0,scandinavian linguisticsm scandinavian linguis...
7,0,call papers linguistics session mlacall papers...
8,0,foreign language commercialscontent length num...
9,0,fulbright announcement please post disseminate...


# Last Ten Values of the Dataset

In [None]:
df.tail(10)

Unnamed: 0,label,Cleaned_Text
2883,0,evolvable hardware gppaper available post scri...
2884,1,work calsvxtnhello thanks stopping taken many ...
2885,0,british vs american sgriffin bacal internet ma...
2886,1,fanny recommending nekdear sir madam spam mess...
2887,1,win Money numbersusd cruise raquel casino inc ...
2888,1,love profile ysuolvpvhello thanks stopping tak...
2889,1,asked join kiddinthe list owner kiddin invited...
2890,0,anglicization composers namesjudging return po...
2891,0,numbers numbers comparative method n ary compa...
2892,0,american english australiahello working thesis...
