#### Used dataset: Social Network Ads https://www.kaggle.com/datasets/rakeshrau/social-network-ads

#### Used packages:
##### Package Versions: pandas: 2.0.2, numpy 1.25.0, matplotlib 3.7.1, missingno 0.5.2, seaborn 0.12.2
##### Python Version: 3.11.4

## TASK 1:

#### Business understanding:
Suppose we are an e-commerce company that runs advertisements to target potential customers. We want to determine if there is a correlation between gender, age, estimated salary, and the users' purchasing behavior due to our ads.

Our objective is to maximize the effectiveness of our ads by optimizing our budget and gaining a better understanding of our target audience. On the basis of existing data, we hope to be able to draw conclusions and thus determine whether a customer buys through our advertising or not on the basis of the characteristics mentioned.

#### Dataset and attribute explanation:
The dataset shows whether a person purchased a product because of social media advertisement based on gender, age and estimated salary.

##### The dataset consists of five columns:
##### UserID            = ID of each user
##### Gender            = Gender of each user
##### Age               = Age of each user
##### EstimatedSalary   = Estimated salary of each user
##### Purchased         = Shows if the user has purchased because of the AD (1 = Purchased; 0 = Not purchased)

##### -> 'Gender', 'Age' and 'EstimatedSalary' represent the independent variables.
##### -> 'Purchased' represents the dependent variable.
##### -> 'User ID' is not required, as the column has no significance.

##### Why I use the SVM-Algorithm:
The objective of the SVM algorithm is to construct an optimal hyperplane that effectively separates classes within an n-dimensional space. By achieving this, we can accurately classify new data points into their respective categories in the future. The hyperplane represents the optimal decision boundary.

Regarding the given dataset I want to classify if a person purchases or not based on the given attributes by using SVM. So based on these attributes we can predict whether a customer purchases or not.


### 1. Step: Importing libraries and dataset

In [2]:
#Importing the libraries

import numpy as np #Linear algebra
import pandas as pd #Data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #For plotting
import missingno as msno #For missing values analysis
import seaborn as sns #For plotting
from matplotlib.colors import ListedColormap #For plotting
from sklearn.model_selection import train_test_split #For splitting data into train/test
from sklearn.preprocessing import StandardScaler #For feature scaling
from matplotlib.colors import ListedColormap #For plotting 
from sklearn.svm import SVC #For SVM classification, imported from sklearn library
from sklearn.metrics import confusion_matrix #For confusion matrix 
#from imblearn.over_sampling import SMOTE #For oversampling data
from sklearn.metrics import classification_report #For classification report
from sklearn.model_selection import GridSearchCV #For grid search cross validation

ModuleNotFoundError: No module named 'numpy'

### 2. Step: Import dataset and first examination

In [3]:
df = pd.read_csv('Social_Network_Ads.csv') #Loading the dataset
# Dataset can be found at: https://www.kaggle.com/datasets/rakeshrau/social-network-ads

NameError: name 'pd' is not defined

In [None]:
df.head() #Checking the first 5 rows of the dataset to get an idea of the dataset

In [None]:
df.tail() #Checking the last 5 rows of the dataset

In [None]:
 #Checking the number of rows and columns
print("Number of rows is = ", df.shape[0], " \nNumber of columns is = " , df.shape[1])

In [None]:
df.info #Checking the overall information of the dataset

In [None]:
df.dtypes #Checking the data types of the columns

##### -> All columns numerical (=Integer), except 'Gender' (=Object)!

In [None]:
df.describe() #Checking the statistical summary of the dataset for outlier detection

#### -> No outliers!

In [None]:
#Checking for missing values

print(df.isnull().sum()) #Sum of missing values in each column

msno.matrix(df) #Missing values show up as white lines in the plot
plt.title('Missing Data Matrix') #Title of the plot
plt.show() #Displaying the plot

##### -> No missing values in the dataset!

In [None]:
#Checking for duplicate rows

df.duplicated().sum() #Sum of duplicate rows

##### -> No duplicates in the dataset!

All in all the dataset consists of 400 rows and 5 columns. 'User ID', 'Age', Estimated Salary' and 'Purchased' are numerical columns. Only 'Gender' is an object-type which needs further checking later on (Step 5: Data preprocessing) becasue object-type values often cause problems when used in ML algorithms. Also the dataset doesn't have any missing values and even no duplicates. No modifications needed at that stage, the dataset itself is in a good status and nearly ready to bes processed. Like mentioned before, only the 'Gender' column could become a problem. If there were a lot of missing values or duplicates, cleaning would be necessary.

### 3. Step: Exploratory Data Analysis (EDA)

In [None]:
#Counting how many people purchased and how many didn't

df['Purchased'].value_counts() #Count values in column 'Purchased'

sns.set(style="darkgrid")  # Set the plot style to darkgrid 

sns.countplot(x='Purchased', data=df) #Plotting the count of Purchased
plt.title("Count of Purchased") #Title of the plot
plt.xlabel("Purchased") #X-axis label
plt.ylabel("Count") #Y-axis label
plt.show() #Displaying the plot