<a href="https://colab.research.google.com/github/devadathen/datasciencelab/blob/main/Devadathan__KNN2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Problem Statement

Nowadays, social media advertising is one of the popular forms of advertising. Advertisers can utilise user's demographic information and target their ads accordingly.  You are given a dataset having the following attributes:

|Field|Description|
|---:|:---|
|UserID|Unique ID|
|Gender|Male or Female|
|Age|Age of a person|
|EstimatedSalary|Salary of a person|
|Purchased|‘0’ or ‘1’. ‘0’ means not purchased and ‘1’ means purchased.|


**Source:** https://www.kaggle.com/rishabhsingh98/social-network-ads

**Citation:** Rishabh Singh. (2020). Social Network Ads.

Implement kNN Classifier to determine whether a user will purchase a particular product displayed on a social network ad or not.

---

### List of Activities

**Activity 1:** Import Modules and Read Data

  
**Activity 2:**  Perform Train-Test Split

**Activity 3:**  Determine the Optimal Value of  $k$

**Activity 4:** Build kNN Classifier Model






---

#### Activity 1: Import Modules and Read Data

Import the necessary Python packages.

Read the data from a CSV file to create a Pandas DataFrame.

**Dataset-->**  social-network-ads.csv

Also, print the first five rows of the dataset. Check for null values and treat them accordingly.


In [1]:
# Import all the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
#Load Dataset
import pandas as pd
df = pd.read_csv('/content/sample_data/social-network-ads.csv')
# print(df)

# Print first five rows using head() function
df.head(6)

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
5,15728773,Male,27,58000,0


In [2]:
# Check if there are any null values. If any column has null values, treat them accordingly df.isnull()
df.isnull().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

**Q:** Are there any missing or null values in the dataset?

---



---



**A:** No.

---

#### Activity 2: Perform Train-Test Split

In this dataset, `Purchased` is the target variable and all other columns other than `Purchased` are feature variables.

Create two separate DataFrames, one containing the feature variables and the other containing the target variable. Also, drop the `User ID` column from the features DataFrame as it is of no use.





In [3]:
# Split the dataset into dependent(target) and independent features
x=df.drop(['User ID','Purchased'],axis=1)
y=df['Purchased']
print(x)
print(y)

     Gender  Age  EstimatedSalary
0      Male   19            19000
1      Male   35            20000
2    Female   26            43000
3    Female   27            57000
4      Male   19            76000
..      ...  ...              ...
395  Female   46            41000
396    Male   51            23000
397  Female   50            20000
398    Male   36            33000
399  Female   49            36000

[400 rows x 3 columns]
0      0
1      0
2      0
3      0
4      0
      ..
395    1
396    1
397    1
398    0
399    1
Name: Purchased, Length: 400, dtype: int64


Print the summary of features DataFrame to determine the data type of each feature variable.

In [4]:
# Use 'info()' function with the features DataFrame.
print(x.info())
print(y.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           400 non-null    object
 1   Age              400 non-null    int64 
 2   EstimatedSalary  400 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 9.5+ KB
None
<class 'pandas.core.series.Series'>
RangeIndex: 400 entries, 0 to 399
Series name: Purchased
Non-Null Count  Dtype
--------------  -----
400 non-null    int64
dtypes: int64(1)
memory usage: 3.2 KB
None


Convert categorical `Gender` feature into numerical  by calling the `get_dummies()` function of `pandas` module and passing features DataFrame as input.




In [5]:
from pandas.core.reshape.encoding import get_dummies
# Use 'get_dummies()' function to convert each categorical column in a DataFrame to numerical.
dummy = pd.get_dummies(x)
dummy.head(5)

Unnamed: 0,Age,EstimatedSalary,Gender_Female,Gender_Male
0,19,19000,0,1
1,35,20000,0,1
2,26,43000,1,0
3,27,57000,1,0
4,19,76000,0,1


Split the dataset into train set and test set such that the train set contains 70% of the instances and the remaining instances will become the test set.

In [6]:
# Split the DataFrame into the train and test sets.
# Perform train-test split using 'train_test_split' function.
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=100)
# Print the shape of the train and test sets.
print("x Train",x_train.shape)
print("x Test",x_test.shape)
print("y Train",y_train.shape)
print("y Test",y_train.shape)

x Train (280, 3)
x Test (120, 3)
y Train (280,)
y Test (280,)


After this activity, you must obtain train and test sets so that they can be used for training and testing the kNN Classifier.

#### Activity 4: Build kNN Classifier Model

Deploy the kNN Classifier model for the optimal value of $k$ using the steps given below:   

1. Import the `KNeighborsClassifier` class from the `sklearn.neighbors` module (if not imported yet).

2. Create an object of `KNeighborsClassifier` and pass the optimal $k$ value as 5 to its constructor.

3. Call the `fit()` function using the classifier object and pass the train set as inputs to this function.

4. Perform prediction for train and test sets using the `predict()` function.

5. Also, determine the accuracy score of the train and test sets using the `score()` function.

In [7]:
# Train kNN Classifier model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train,y_train)
# Perform prediction using 'predict()' function.
y_pred = knn.predict(x_test)
print("Accuracy",accuracy_score(y_test,y_pred))
sample=[[10,13240,1]]
pred=knn.predict(sample)
print(pred)
# Call the 'score()' function to check the accuracy score of the train set and test set.


ValueError: ignored

Print the classification report to get an in-depth overview of the classifier performance using the `classification_report()` function of `sklearn.metrics` module.

In [None]:
# Display the precision, recall, and f1-score values.
from sklearn.metrics import classification_report
print(classification_report(y_test,pred))

**Q:** Write down the f1-scores for both the target labels.

**A:**




---

