# What is heterogenous data?
- Hetergenous data is the one that comes from different sources and is available in various format.
- It can include structed, semi - structured and unstructed data.
- Examples of a heterogenous data include: Databases, Spreadsheets, Log files, Sensor data, Text, Social media posts, Emails.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Problem Statement:
- Looking at the e-mail message you are supposed to predict whether the message is a ham or a spam email.
- The data has about 5572 email messages.

In [11]:
data = pd.read_csv("spam.csv")
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data["Message"].iloc[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [6]:
data["Message"].iloc[2]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

In [7]:
data.shape

(5572, 2)

# Observations:

- the data contains 5572 rows and 2 columns
- Each row contain one email message which can either be a spam or a ham
- The message column has heterogenous data indicating that the normal preprocessing steps cannot be applied onto this data.

In [8]:
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
# Converting category into a numerical data

dic = {"ham" : 0 , "spam" : 1}
data['Category'] = data['Category'].replace(dic)
data.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


# Rearrange the columns since Category is the target. and Message is the independent variable

In [13]:
data = data[["Message", "Category"]]
data.head()

Unnamed: 0,Message,Category
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


# Observations:
- In order to predict a message as a spam or a ham we first have to convert Message into numerical data using some preprocessing steps.


In [14]:
# Applying one hot encoding on Message column
data["Message"].iloc[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [16]:
### Apply ohe

pd.get_dummies(data["Message"].iloc[0:5])

Unnamed: 0,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","Nah I don't think he goes to usf, he lives around here though",Ok lar... Joking wif u oni...,U dun say so early hor... U c already then say...
0,False,True,False,False,False
1,False,False,False,True,False
2,True,False,False,False,False
3,False,False,False,False,True
4,False,False,True,False,False


# Observations:
- When one hot encoding is applied on the first five rows of the data, we see that the entire message is getting converted into a column name
- With 5572 messages, you can imagine that there will be 5572 columns created.
- With 5572 columns when you apply any ML algorithm will not produce expected outputs or the outputs won't make meaning to us.

In [18]:
text = ["It was the best of times", "India is an, incredible country", "I #enjoying staying in India!"]
text

['It was the best of times',
 'India is an, incredible country',
 'I #enjoying staying in India!']

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec

In [20]:
vec.fit(text)

In [21]:
vec.vocabulary_

{'it': 8,
 'was': 13,
 'the': 11,
 'best': 1,
 'of': 9,
 'times': 12,
 'india': 6,
 'is': 7,
 'an': 0,
 'incredible': 5,
 'country': 2,
 'enjoying': 3,
 'staying': 10,
 'in': 4}

In [22]:
text

['It was the best of times',
 'India is an, incredible country',
 'I #enjoying staying in India!']

# Observations:
- The idea of CountVectorizer is to convert a text data like above into numerical data. To do so, we have to perform 2 steps using CountVectorizer.
- 1. fit() - It just assigns ranks to all distint words avaiable in the entire text.
  2. transform() - This is where the actual conversion to numerical happens.
- There are few more things to observe in the fit() output.
- 1. every words is present in lower case.
  2. Special characters from original text are not appearing in the fit() output.
  3. It ignores the single letter words like I etc.
  4. These changes are done internally by the CountVectorizer fit()

In [23]:
vec.transform(text)

<3x14 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [24]:
vec.transform(text).toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
       [1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0]], dtype=int64)

In [25]:
vec.get_feature_names_out()

array(['an', 'best', 'country', 'enjoying', 'in', 'incredible', 'india',
       'is', 'it', 'of', 'staying', 'the', 'times', 'was'], dtype=object)

In [26]:
res = pd.DataFrame(vec.transform(text).toarray(), columns = vec.get_feature_names_out())
res

Unnamed: 0,an,best,country,enjoying,in,incredible,india,is,it,of,staying,the,times,was
0,0,1,0,0,0,0,0,0,1,1,0,1,1,1
1,1,0,1,0,0,1,1,1,0,0,0,0,0,0
2,0,0,0,1,1,0,1,0,0,0,1,0,0,0


In [27]:
text

['It was the best of times',
 'India is an, incredible country',
 'I #enjoying staying in India!']

# Observations:
- Each sentence from the text has been now converted to numerical data.
- The total sentences in the text determines the total rows after the Count Vectorized output

In [28]:
text2 = ["I am interested in learning Data Science", "I love my country", "The Times of india is the best newspaper for english"]
text2

['I am interested in learning Data Science',
 'I love my country',
 'The Times of india is the best newspaper for english']

In [30]:
new_data = vec.fit_transform(text2)
new_data

<3x18 sparse matrix of type '<class 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

In [31]:
new_data.toarray()

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 2, 1]],
      dtype=int64)

In [33]:
vec.get_feature_names_out()

array(['am', 'best', 'country', 'data', 'english', 'for', 'in', 'india',
       'interested', 'is', 'learning', 'love', 'my', 'newspaper', 'of',
       'science', 'the', 'times'], dtype=object)

In [34]:
res2 = pd.DataFrame(new_data.toarray(), columns = vec.get_feature_names_out())
res2

Unnamed: 0,am,best,country,data,english,for,in,india,interested,is,learning,love,my,newspaper,of,science,the,times
0,1,0,0,1,0,0,1,0,1,0,1,0,0,0,0,1,0,0
1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
2,0,1,0,0,1,1,0,1,0,1,0,0,0,1,1,0,2,1


# Spam Ham Classification

In [35]:
data.head()

Unnamed: 0,Message,Category
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


# Seperate X and y

In [36]:
X = data['Message']
y = data['Category']

In [37]:
X.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object

# Encode X using CountVectorizer()

In [38]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv

In [39]:
X_vec = cv.fit_transform(X)
X_vec

<5572x8709 sparse matrix of type '<class 'numpy.int64'>'
	with 74098 stored elements in Compressed Sparse Row format>

In [40]:
X_vec.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [41]:
cv.get_feature_names_out()

array(['00', '000', '000pes', ..., 'èn', 'ú1', '〨ud'], dtype=object)

In [42]:
pd.DataFrame(X_vec.toarray(), columns = cv.get_feature_names_out())

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Split the data into train and test sets

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size = 0.3, random_state = 0)

# Apply Logistic Regression on the train set

In [44]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr

In [45]:
lr.fit(X_train, y_train)

# Performing predictions on X_test

In [46]:
y_pred = lr.predict(X_test)
y_pred

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

# Let's predict on a sample message

In [47]:
message1 = [
    "Even my brother does not like to speak with me. They treat me as a aids patients."
]
message1

['Even my brother does not like to speak with me. They treat me as a aids patients.']

In [48]:
lr.predict(cv.transform(message1))

array([0], dtype=int64)

In [50]:
message2 = [
    "England vs Macedonia - dont miss the goals/team news. text your national team to 87077 eg : ENGLAND TO 87077Try:Wales, SCOTLAND 4txt/A~0&.20"
]
message2

['England vs Macedonia - dont miss the goals/team news. text your national team to 87077 eg : ENGLAND TO 87077Try:Wales, SCOTLAND 4txt/A~0&.20']

In [51]:
lr.predict(cv.transform(message2))

array([1], dtype=int64)

In [52]:
20:37 - 20:42

SyntaxError: illegal target for annotation (3576885784.py, line 1)

# Check Accuracy

In [53]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9760765550239234

In [54]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1445,    6],
       [  34,  187]], dtype=int64)

# observations:

- The number of correct classifications are high in this case 1445 and 187
- The number of incorrect classifications are low in this case 34 and 6
- 1445 - False Negative. Message is a spam and predictions is also a spam
- 187 - True Positive. Message is a ham and prediction is also a ham
- 34 - True Negative. Message is a spam but prediction is a ham
- 6 - False Positive. Message is a ham but prediction is a spam

# Create a pipeline for performing text processing


In [55]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec

In [56]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr

In [57]:
steps = [("Text processing", vec), ("ML Modelling", lr)]
steps

[('Text processing', CountVectorizer()),
 ('ML Modelling', LogisticRegression())]

In [59]:
from sklearn.pipeline import Pipeline 
pipe = Pipeline(steps)
pipe

In [64]:
pipe.fit(X, y)

In [66]:
dataset = pd.read_csv("train.tsv", sep = '\t')
dataset.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [67]:
dataset.shape

(1482535, 8)

In [68]:
dataset["item_description"]

0                                         No description yet
1          This keyboard is in great condition and works ...
2          Adorable top with a hint of lace and a key hol...
3          New with tags. Leather horses. Retail for [rm]...
4                  Complete with certificate of authenticity
                                 ...                        
1482530    Lace, says size small but fits medium perfectl...
1482531     Little mermaid handmade dress never worn size 2t
1482532            Used once or twice, still in great shape.
1482533    There is 2 of each one that you see! So 2 red ...
1482534    New with tag, red with sparkle. Firm price, no...
Name: item_description, Length: 1482535, dtype: object