## Recap:

1. Pipelines for Data preprocessing
2. Column Transformers for Data preprocessing
3. Pipelines for Data preprocessing and also Machine Learning
4. Case Studies for Pipelines on multiple datasets

## Agenda:

1. Define what is a hetergeneous data.
2. How to deal with hetergeneous data
3. Implement techniques to deal with hetergeneous data
4. Pipeline to deal with heterogeneous data

## Loading the standard libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Load the data

In [2]:
data = pd.read_csv('spam.csv')
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Note:

1. Spam mail = harmful mail
2. ham mail = safe mail

In [3]:
data.shape

(5572, 2)

## Rearrange the dataset with target variable as the last column in the dataset

In [4]:
data = data[['Message', 'Category']]
data.head()

Unnamed: 0,Message,Category
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


## Encoding the category column of the data

In [5]:
cat_labels = {'ham' : 0, 'spam' : 1}
data = data.replace({'Category' : cat_labels})
data.head()

Unnamed: 0,Message,Category
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


### Notes:

1. Inorder to apply any classification Machine learning algorithm like Logistic Regression or KNeighborsclassifier on the above data, the message has to be encoded into some numerical format.
2. Hence, the task is to encode the Message column using some technique
3. Message column in the data is a text column. Hence, we cannot apply either Label Encoding or One Hot Encoding on this kind of text column.

## Q. how to encode the Message into numerical format? - CountVectorizer

In [6]:
text = ['It was the best of times', 'India is an incredible country', 'I am living in India']
text

['It was the best of times',
 'India is an incredible country',
 'I am living in India']

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec

In [8]:
vec.fit(text)

In [9]:
vec.vocabulary_

{'it': 8,
 'was': 13,
 'the': 11,
 'best': 2,
 'of': 10,
 'times': 12,
 'india': 6,
 'is': 7,
 'an': 1,
 'incredible': 5,
 'country': 3,
 'am': 0,
 'living': 9,
 'in': 4}

In [10]:
text

['It was the best of times',
 'India is an incredible country',
 'I am living in India']

In [11]:
text2 = ['It was the best of times', 'The Times of India', 'I am living in India']
text2

['It was the best of times', 'The Times of India', 'I am living in India']

In [12]:
vec.fit(text2)

In [13]:
vec.vocabulary_

{'it': 4,
 'was': 9,
 'the': 7,
 'best': 1,
 'of': 6,
 'times': 8,
 'india': 3,
 'am': 0,
 'living': 5,
 'in': 2}

In [14]:
## Transform the text by applying countvectorizer on the text
vec.transform(text)

<3x10 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [15]:
vec.transform(text).toarray()

array([[0, 1, 0, 0, 1, 0, 1, 1, 1, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 1, 0, 1, 0, 0, 0, 0]], dtype=int64)

In [16]:
res = pd.DataFrame(vec.transform(text).toarray(), columns = vec.get_feature_names_out())
res

Unnamed: 0,am,best,in,india,it,living,of,the,times,was
0,0,1,0,0,1,0,1,1,1,1
1,0,0,0,1,0,0,0,0,0,0
2,1,0,1,1,0,1,0,0,0,0


In [17]:
text

['It was the best of times',
 'India is an incredible country',
 'I am living in India']

## Spam Ham classificaton

In [18]:
data.head()

Unnamed: 0,Message,Category
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0


## Seperate X and y from the data

In [19]:
X = data['Message']
y = data['Category']

## Encode the message column using CountVectorizer

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec

In [21]:
X_vec = vec.fit_transform(X)
X_vec

<5572x8709 sparse matrix of type '<class 'numpy.int64'>'
	with 74098 stored elements in Compressed Sparse Row format>

## Split the data into train test set

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size = 0.3, random_state = 0)

## Apply logistic Regression on X_train and y_train

In [23]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr

In [24]:
lr.fit(X_train, y_train)

## Perform prediction on X_test

In [25]:
y_pred = lr.predict(X_test)
y_pred

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

In [26]:
pd.DataFrame(X_test.toarray(), columns = vec.get_feature_names_out())

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1667,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1668,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1669,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1670,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Check the log reg model on sample email data

In [27]:
email_ham = [
    'Even my brother is not like to speak with me. They treat me like aids patient.'  ## Ham
]

In [28]:
lr.predict(vec.transform(email_ham))

array([0], dtype=int64)

In [29]:
email_spam = [
    'England vs Macedonia - dont miss the goals/team news. txt ur national team to 87077 eg ENGLAND to 87077Try:WALES, SCOTLAND 4txt/A~0&.20'
]
email_spam

['England vs Macedonia - dont miss the goals/team news. txt ur national team to 87077 eg ENGLAND to 87077Try:WALES, SCOTLAND 4txt/A~0&.20']

In [30]:
lr.predict(vec.transform(email_spam))

array([1], dtype=int64)

## Evaluation step:

In [31]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)

0.9760765550239234

# =========================================================================================

## Create a pipeline to perform text processing

In [32]:
steps = [('vectorization', CountVectorizer()), ('Classification', LogisticRegression())]
steps

[('vectorization', CountVectorizer()),
 ('Classification', LogisticRegression())]

In [33]:
## Importing the libraries to be specified in the make_pipeline function

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
pipe = Pipeline(steps)
pipe

## Fit the pipeline on X_train and y_train

In [35]:
pipe.fit(X_train, y_train)

AttributeError: lower not found

## Create a Column Transformation on the text data using countvectorizer

1. Create a num pipeline
2. Create a categorical pipeline
3. Create a text pipeline
4. Seperate num_features, categorical_features, text_features
5. Create full_pipeline combining num pipeline, cat pipeline, text pipeline and apply the full pipeline 
on num_features, cat_features and text_features

In [37]:
dataset = pd.read_csv('train.tsv', sep = '\t')
dataset.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [38]:
dataset.shape

(1482535, 8)

## Observations:

- Total rows = 14,82,535
- Total columns = 8