# 1. Problem Statement
This is a problem that companies often have while running their marketing campaigns. The dataset is attached below. The task is to predict which customers will most likely click on the ad. Let’s consider that you are working for a marketing company. We assume that you have an online marketing campaign for which you spend \\$1000 per potential customer. For each customer that you target with your digital ad campaign and that clicks on the ad, let’s assume that you’ll get a net profit of $100. Your task is to come up with a model that will maximize the profit of the company. 
- How you would answer this problem? 
- Are there any features more important than others? 
- Can you train a model that will make adequate prediction. 
- Do you need to perform feature engineering
- Do you need to do any preprocessing (data cleansing, imputation, etc)
- Can you justify each process that you performed on the dataset?
- What would be your recommendations for the company?
- What would be the next step?The dataset is fairly straight forward to understand so if you are experienced, you might feel less challenged. If you are a beginner, feel free to start easy with a simple approach. Take the time to do exploratory data analysis to fully understand the dataset. Make sure that you spend some time understanding the relationship between the different features. Feel free to ask questions, and try it out on your own first as if it was a take-home assignment.

# 2. Introduction


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


%matplotlib inline 


# 3. Data Exploration

We will first import the data and take a first look. We will also try to identify and address obvious issues with the data, such as missing values, duplicates, etc.

In [2]:
# Loading data
data = pd.read_csv('advertising_dsdj.csv')

In [3]:
data.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 0:53,0.0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 1:39,0.0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35,0.0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 2:31,0.0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 3:36,0.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018 entries, 0 to 1017
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1018 non-null   float64
 1   Age                       1018 non-null   int64  
 2   Area Income               1018 non-null   float64
 3   Daily Internet Usage      1018 non-null   float64
 4   Ad Topic Line             1018 non-null   object 
 5   City                      1018 non-null   object 
 6   Male                      1018 non-null   int64  
 7   Country                   1018 non-null   object 
 8   Timestamp                 1018 non-null   object 
 9   Clicked on Ad             1014 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 79.7+ KB


In [5]:
# Checking for missing values
data.isna().sum() 

Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               4
dtype: int64

There seem to be 4 missing values in the target variable so we will drop these observations.

In [None]:
data = data.dropna(axis = 0)

In [None]:
# Checking for duplicates
print('Number of duplicated records in dataset:', data.duplicated().sum())

We drop the duplicated records.

In [None]:
data = data.drop_duplicates()

# 4. Exploratory data analysis

## 4.1 Describing the features and data cleaning

First, let's inspect the target variable for class imbalance.

In [None]:
# inspect for class imbalance of the target variable
data['Clicked on Ad'].value_counts() / data.shape[0]

This is very well balanced classification problem. Therefore, we don't have to worry about class imbalance issues.

Let us separate numerical and categorical variables to take a quick look at the summary statistics.

In [None]:
def num_cat_columns(df):
    '''
    Separates numerical and categorical (type object) variables in given dataframe.
    Returns a 2 element list containing the list of numerical variable names as first 
    element and the list of categorical variable names as second element.
    This is a preliminary separation based on data type upon import and should be used
    with caution, as some int64 variables may be randomly generated IDs and not true 
    numerical variables and some categorical variables might be represented as numerical
    (e.g. classes).
    '''
    col_num = []
    col_cat = []
    for column in df.columns:
        if df[str(column)].dtypes != 'O':       
            col_num.append(column)
        else:
            col_cat.append(column)
    return [col_num, col_cat]

[col_num, col_cat] = num_cat_columns(data)

In [None]:
data[col_num].describe()

In [None]:
data[col_cat].describe()

We see that for most of the true numerical features (not the classes 'Male' and 'Clicked on Ad'), the mean and the median are relatively close, indicating low skewness. This suggests that we are not likely to have to transform the data based on skewness when engineering features. We can confirm this later after visualisations.

However, we see that there are some bizarre values for the age feature: the minimum is negative and the maximum corresponds to an unrealistic human age. Let's visualise the Age feature by sorting it and plotting it against its index to easily see the outliers.

In [None]:
sorted_age = sorted(data['Age'])
indices = []
for i in range(len(sorted_age)):
    indices.append(i)
x = indices
y = sorted_age

plt.figure(figsize=(10,8))
plt.scatter(x, y)
plt.axhline(y=0, linestyle='dotted', color='r')
plt.axhline(y=100, linestyle='dotted', color='r')
plt.grid()


Let's take a closer look at the observations we have for extreme age values.

In [None]:
data[(data['Age'] < 18) | (data['Age'] > 80) ]

We can drop these observations as they are either obviously wrong or very likely to be included in the dataset due to error.

In [None]:
data = data[(data['Age'] >= 18) & (data['Age'] <= 80) ]

We will transfrom the Timestamp feature to datetime to be able to work with it.

In [None]:
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

In [None]:
data.head()

Let's also confirm that the time spent on site is less than the daily internet usage for all observations.

In [None]:
data[(data['Daily Internet Usage'] - data['Daily Time Spent on Site'] < 0)]

We finally remove these observations too.

In [None]:
data = data[~(data['Daily Internet Usage'] - data['Daily Time Spent on Site'] < 0)]

## 4.2 Data Visualisation

First, we will define a function to use for univariate data visualisation.

In [None]:
def univar_plot_num(df, col, target):
    '''
    Function to plot numerical features. Creates two subplots: a histogram and a 
    boxplot where the observations of the column are separating according to the
    target.
    '''
    df_clean = df.dropna()
    f, axes = plt.subplots(ncols = 2, figsize = (14, 6))
    plt.suptitle('Feature: ' + col)
    sns.histplot(df_clean[col], kde = True, ax = axes[0])
    sns.boxplot(y = col, x = target, data = df_clean, ax = axes[1])
    return

We will start with a univariate plot of a histogram of the numerical variables and a boxplot where the data is split according to the value of the target varibale.

In [None]:
cols_num_plot = col_num[:]
del(cols_num_plot[-1])
cols_num_plot

In [None]:
for col in cols_num_plot:
    univar_plot_num(data, col, 'Clicked on Ad')

The "Daily Time Spent on Site" and the "Daily Internet Usage" appear to be slightly bimodal. This could suggest the existence of different groups of customers. 

In addition, there appears to be some skewness for the "Age" and "Area Income" columns, indicating that we could try a logarithmic transformation for the right-skewed age feature when modeling our data to avoid bias.

We can see from the box plots that "Daily Time Spent on Site" and "Daily Internet Usage" might have significant predictive power based on the difference between the groups of people who clicked on the ad vs. did not click. The "Age" and "Area Income" also could potentially be useful.

Let's look into correlations between features.

In [None]:
def correlation_plot(df):
    correlation = df.corr()
    plt.figure(figsize=(10,8))
    sns.heatmap(correlation, 
          xticklabels=correlation.columns.values,
          yticklabels=correlation.columns.values)
    print(correlation)

In [None]:
correlation_plot(data)

With the exception of gender, we observe some significant correlation between all features and the target. However, there seems to be high correlation between some of the predictive features, which is a fact that we might have to revisit later.

It is now time to inspect the categorical variables. We will call .describe() again to see how things look like after data cleaning.

In [None]:
data[col_cat].describe()

We see that "Ad Topic Line" and "City" probably have no predictive power, but we can take a look at the country feature.

In [None]:
data['Country'].value_counts()

In [None]:
clicked_country = pd.crosstab(data['Country'], data['Clicked on Ad'], rownames=['Country'], colnames=['Clicked On Ad'])
clicked_country.sort_values(1, ascending = False).head(10)

It seems like there is a big spread across 237 countries, with a maximum of 9 clicks per countr, therefore the distribution will not be interesting to look at.

Finally, let's do some pair plots:

In [None]:
plt.figure()
sns.pairplot(data=data[col_num],
             hue="Clicked on Ad",
             dropna=True)

## 4.3 Feature selection and engineering
We will proceed with some feature selection and engineering, as well as some preprocessing in order to be able to use the data to train a classification algorithm.
### 4.3.1 Dropping features

To simplify our model, we will drop some features that do not seem to offer predictive value as described above or that would require special considerations, such as time analysis. 

In [None]:
df = data.drop(['City', 'Ad Topic Line', 'Timestamp', 'City'], axis=1)

### 4.3.2 Tranformations for skewed features

Earlier, the "Age" feature was observed to be right-skewed. Therefore, we will perform a logarithmic transformation

In [None]:
df['log_age'] = np.log(df['Age'])
df = df.drop('Age', axis = 1)
univar_plot_num(df, 'log_age', 'Clicked on Ad')

The skewness of the distribution is visibly reduced.

### 4.3.3 Scaling of numerical variables
We will scale the numerical variables to acheive better results with logistic regression.

In [None]:
df.head()

In [None]:
categ_variables = ['Country']
numer_variables = ['Daily Time Spent on Site', 'Area Income', 'Daily Internet Usage', 'Male', 'log_age']
target = 'Clicked on Ad'

In [None]:
scaler = MinMaxScaler()
df[numer_variables] = scaler.fit_transform(df[numer_variables])
df.head()

### 4.3.4 One Hot Encoding for Categorical Features

In [None]:
df_model = pd.get_dummies(df)

# 5. Modelling
In this section, we will attempt a few solutions and then select the best model. We will use F1 score as the performance metric.
## 5.1 Data Preparation

In [None]:
# Obtaining training and test set 
X, y = df_model.drop(target,1).values, df[target].values
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                  test_size=0.2,
                                                  random_state=476,
                                                  stratify=y)

## 5.2 Model selection
## 5.2.1 Logistic regression 
We will start with a quick and dirty implementation of a logistic regression with default parameters.

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
print(classification_report(y_test,predictions))