###### ARTIFICIAL INTELLIGENCE ALGORITHMS
### Practical Lab 
# **#2**

In [115]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from scipy.stats import zscore

#### PART A

In [116]:
msg_data = pd.read_csv("../../practical_labs/Lab2_dataset.csv")
print(msg_data.head())

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(
    msg_data["text"]
).toarray()  # Vectorize string values for use in classifier
y = msg_data["label_num"]

   Unnamed: 0 label                                               text   
0         605   ham  Subject: enron methanol ; meter # : 988291\nth...  \
1        2349   ham  Subject: hpl nom for january 9 , 2001\n( see a...   
2        3624   ham  Subject: neon retreat\nho ho ho , we ' re arou...   
3        4685  spam  Subject: photoshop , windows , office . cheap ...   
4        2030   ham  Subject: re : indian springs\nthis deal is to ...   

   label_num  
0          0  
1          0  
2          0  
3          1  
4          0  


In [117]:
# Splitting the data to test and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=10
)


gauss_model = GaussianNB()
gauss_model.fit(X_train, y_train)  # Train the model with training features and targets

gauss_preds = gauss_model.predict(
    X_test
)  # With model trained on the training data, make prediction for test features.


multi_model = MultinomialNB()
multi_model.fit(X_train, y_train)

multi_preds = multi_model.predict(
    X_test
)  # With model trained on the training data, make prediction for test features.


gauss_score = accuracy_score(
    y_test,
    gauss_preds,
)  # Calculate a score metric for model comparison


multi_score = accuracy_score(
    y_test, multi_preds
)  # Calculate a score metric for model comparison

print(f"Gaussian Accuracy {gauss_score}")
print(f"Multinomial Accuracy {multi_score}")

Gaussian Accuracy 0.9584541062801932
Multinomial Accuracy 0.881159420289855


Here in this case, Gaussian classifier performs better than Multinomial classifier with a higher accuracy of 95% compared to 88% of Multinomial classifier.

In [118]:
# Sample predictions from test data.
print(f"Prediction: {gauss_model.predict(X_test[0:1, :])}")
print(f"Actual target: {y_test[0:1]}")

Prediction: [0]
Actual target: 529    0
Name: label_num, dtype: int64


From the input dataset, we know that *0* represent **Ham** and *1* represent **Spam**. Our trained model correctly predicted message at index 529 as **Ham** (Both prediction and actual target num is *0*)

#### PART B

In [119]:
booking_data = pd.read_csv("../../practical_labs/AB_NYC_2019.csv")
booking_data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


Here, we are going to look into columns `price`, `minimum_nights`, `number_of_reviews`, `availability_365` while checking for outliers.

In [120]:
value_cols = ["price", "minimum_nights", "number_of_reviews", "availability_365"]

We can use different techniques to detect outliers. One of them is finding z-score for each value which denotes how many standard deviation away they are from the mean.

In [121]:
z_scores = np.abs(zscore(booking_data[value_cols]))
z_scores

Unnamed: 0,price,minimum_nights,number_of_reviews,availability_365
0,0.015493,0.293996,0.320414,1.916250
1,0.300974,0.293996,0.487665,1.840275
2,0.011329,0.196484,0.522433,1.916250
3,0.265335,0.293996,5.538156,0.617065
4,0.302811,0.144807,0.320414,0.856865
...,...,...,...,...
48890,0.344452,0.245240,0.522433,0.788486
48891,0.469373,0.147729,0.522433,0.583352
48892,0.157070,0.144807,0.522433,0.651730
48893,0.406912,0.293996,0.522433,0.841669


In [122]:
outlier_rows_zscore = np.where(z_scores > 3)[0]
outlier_rows_zscore

array([    3,     7,     9, ..., 48446, 48523, 48535])

Alternatively, we can use percentile of each value to determine the outlier. We will find 25th and 75th percentile to find Q1, Q3, IQR, left whisker and upper whisker.

In [123]:
Q1 = booking_data[value_cols].quantile(0.25)
Q3 = booking_data[value_cols].quantile(0.75)
IQR = Q3 - Q1

l_whisker = Q1 - 1.5 * IQR
u_whisker = Q3 + 1.5 * IQR
outlier_rows_percentile = booking_data[
    (
        (booking_data[value_cols] < l_whisker) | (booking_data[value_cols] > u_whisker)
    ).any(axis=1)
].index
outlier_rows_percentile

Index([    3,     5,     6,     7,     8,     9,    11,    12,    13,    14,
       ...
       48808, 48810, 48833, 48839, 48842, 48843, 48856, 48871, 48879, 48882],
      dtype='int64', length=14820)

After finding the outliers, we can employ certain techniques to transform the outlier so that they won't alter the prediction or we can remove the rows with outliers altogether. Few of the techniques in transforming outlier includes replacing outlier values with either mean, median or mode value.

Here, let's remove the rows that contain outliers.

In [124]:
booking_data_clean = booking_data.drop(outlier_rows_zscore)
booking_data_clean.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0
5,5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,2019-06-22,0.59,1,129


The above table is cleaned by removing outlier rows

Alternatively, we can choose to replace outlier rows with mean values.

In [125]:
outlier_rows_indicator = z_scores > 3

booking_data_clean = booking_data
booking_data_clean[value_cols] = np.where(
    outlier_rows_indicator, booking_data[value_cols].mean(), booking_data[value_cols]
)

booking_data_clean.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149.0,1.0,9.0,2018-10-19,0.21,6,365.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225.0,1.0,45.0,2019-05-21,0.38,2,355.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150.0,3.0,0.0,,,1,365.0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89.0,1.0,23.274466,2019-07-05,4.64,1,194.0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80.0,10.0,9.0,2018-11-19,0.1,1,0.0


In the above table, all rows are retained and all the outlier values are replaced with their mean. Same steps can be followed to replace value with median or mode.