### Problem Statement
We have a advertising dataset of a marketing agency. Goal is to develop a ML algorithm that predicts if a particular user will click on an advertisement. The dataset has 10 features:

'Daily Time Spent on Site', 
'Age', 
'Area Income',
'Daily Internet Usage', 
'Ad Topic Line',
'City',
'Male',
'Country',
Timestamp' 
'Clicked on Ad'.

**'Clicked on Ad'** is the categorical target feature, which has two possible values: 0 (user didn't click) and 1(user clicked). 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno

%matplotlib inline

### Import data

In [None]:
df = pd.read_csv('../input/advertising/advertising.csv')
df.head()

In [None]:
print(df.shape)
print('features in the dataset:', df.columns.tolist())

so we have total 1000 training examples, each with 10 features.

### Data Analysis

In [None]:
df.info()

In [None]:
missingno.matrix(df)

From above two cells, we see there is no null values in the dataset, which is a good thing.

In [None]:
df.dtypes

In [None]:
# Let's look at stats of the non-object features
df.describe()

Let's go over each of the non-object features one by one:
1. **Daily Time Spent on Site:** We see users spend between 32min to 91min on the site with a mean value of 65min, which is quite a large amount of time. This indicates that it is a popular site. We would like to see if there is any corellation with time spend on the site and 'clicked on Ad'.

2. **Age:** The user age ranges from 19years to 61 years with a mean of 36 years, which tells us that the target users are adults.

3. **Area Income:** The minimum users income is around 13k and the maximum user income is 79k, which tells us that the users belongs to different social classes. We would like to further investigate how the income is corelates with the click on the ad.

4. **Daily Internet Usage:** The daily internet use ranges from 104min to 269min. Out of total daily internet use, users spend quite a large amount of time on the site, which ranges from 32 to 91 min. We will check if they both are relates to each other in some way.

5. **Male:** 48% of the users are male. We will check if gender affects the rate of click on the ad. 

6. **Clicked on Ad:** From the cell above and the cell below, we see that 50% of the ads were clicked and 50% of the ad weren't clicked by the user. Which tells us that our ad dataset is balanced, which will have a positive affect on training accuracy. 

In [None]:
#get the info of the number of ad clicked
fig = plt.figure(figsize = (5,1))
sns.countplot(y ='Clicked on Ad', data = df)
print(df['Clicked on Ad'].value_counts())

In [None]:
sns.jointplot(x='Age',y='Daily Time Spent on Site', data=df)

This plot tells us that the younger users spceially from age 20 to 40, spent most time on the site. So this group of users could be good target group for the ad campaign. We can also say that if a product is targetting a population whose age does not fall into the range 19 to 61, this site is not right platform to advertize the product. 

In [None]:
sns.scatterplot(x='Age',y='Daily Time Spent on Site', 
                hue='Clicked on Ad', data=df, palette='rocket')

This plot tells us that all the users who spent less time on the site click on ad. On the other hand, among the 20 to 55 years user group who spent most time on the site apperently don't click on the ad, whereas the same user group who spents less time clicks on ad.

In [None]:
sns.scatterplot(x='Age',y='Daily Internet Usage', 
                hue='Clicked on Ad', data=df, palette='rocket')

Here we see users who are under age of 52 and who spent more time on the internet does not click on the ad, whereas the rest of the users, who usually spent less time on the internet clicks on the ad.

In [None]:
sns.scatterplot(x='Daily Internet Usage',y='Daily Time Spent on Site', 
                hue='Clicked on Ad', data=df, palette='rocket')

Again here we see users who usually spend less time on the internet are more likely to click on add and the users who spends more time on the internet and on the site are not. So the target users could be the ones who spend more time on the internet and also on the site.

In [None]:
sns.pairplot(df, hue='Clicked on Ad', vars=['Daily Time Spent on Site', 
                                           'Age', 'Area Income',
                                           'Daily Internet Usage'],
            palette='rocket')

Pairplot represents the relationshi between the target feature and the explanatory features. 

We also see that users with higher area income who spends more time on the site does not click on ad also relatively younger users with higher income do not click on ads. So this group of users could be the target users. 

Again the users with higher area income who more likely to spend longer time on the site do not click on ad.



In [None]:
plots = ['Daily Time Spent on Site', 'Age', 
         'Area Income','Daily Internet Usage']
for i in plots:
    plt.figure(figsize=(12,6))
    
    plt.subplot(2,3,1)
    sns.boxplot(data=df,x = 'Clicked on Ad', y=i)
    
    plt.subplot(2,3,2)
    sns.boxplot(data=df,y=i)
    
    plt.subplot(2,3,3)
    sns.distplot(df[i],bins=20)
    plt.tight_layout()
    plt.title(i)
    plt.show()

In [None]:
fig = plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), annot=True)

Here we see Daily Time Spent on Site, Age, Area Income, Daily Intenert Usage are highly correlate with the target variable. Which indicates they are important features and will be useful for ML model.

We also notice that Daily Time spent on site has strong correlation with other with daily intenet usage, Age, Area Inocme and so does daily internet usage. 

#### Now let's take a look at the object features: 
Ad Topic Line, City, Country, Timestamp 

In [None]:
#get the info of the Ad Topic Line
print(df['Ad Topic Line'].value_counts())

In [None]:
object_features = ['Ad Topic Line', 'City', 'Country', 'Timestamp']
df[object_features].describe(include=['O'])  

From above cell we see that all ad topic lines are unique, which indicates this features has less chace of carying any useful information for the prediction model. There are 969 diffirent cities out of 237 countries. These indicates that the users are not from a spcecific demograhic but from all over the world. Even though we see France repeates 9 times, meaning highest number of visitors are from France but still it just 9 of them. 

Let's see if there is any other countries with same number of users.


In [None]:
pd.crosstab(index=df['Country'], columns='count').sort_values(
    ['count'], ascending=False).head(20)

Since there are 237 countries in the dataset and no single country is too dominant. It might be better to remove these features from the dataset. 

In [None]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df['Month'] = df['Timestamp'].dt.month
df['Day'] = df['Timestamp'].dt.day
df['Weekday'] = df['Timestamp'].dt.dayofweek
df['Hour'] = df['Timestamp'].dt.hour
df = df.drop(['Timestamp'], axis=1)

df.head()

In [None]:
df['Month'][df['Clicked on Ad'] == 1].value_counts().sort_index()

In [None]:
df['Day'][df['Clicked on Ad'] == 1].value_counts().sort_index().plot()

In [None]:
df['Weekday'][df['Clicked on Ad'] == 1].value_counts().sort_index()

In [None]:
df['Hour'][df['Clicked on Ad'] == 1].value_counts().sort_index().plot()

### Logistic Regression


In [None]:
df = pd.read_csv('../input/advertising/advertising.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df['Month'] = df['Timestamp'].dt.month
df['Day'] = df['Timestamp'].dt.day
df['Weekday'] = df['Timestamp'].dt.dayofweek
df['Hour'] = df['Timestamp'].dt.hour
df = df.drop(['Timestamp'], axis=1)

df.head()
df.columns

### Missing Values:


In [None]:
df_missing = (df.isnull().sum()/len(df)).sort_values(ascending=False)
df_missing.head()

```
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder

train_features = ['Daily Time Spent on Site', 'Age', 'Area Income',
                   'Daily Internet Usage', 'Male', 
                   'Month', 'Day', 'Weekday', 'Hour']

numeric_features = ['Daily Time Spent on Site', 'Age', 'Area Income',
                   'Daily Internet Usage']

categorical_features = ['Male','Month', 'Day', 'Weekday', 'Hour']

scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
df_ohe = ohe.fit_transform(df[categorical_features])
df_ohe = pd.DataFrame(df_ohe, columns = [f'ohe_{i}' for i in
                                        range(df_ohe.shape[1])])
df = pd.concat([df,df_ohe], axis=1)
df = df.drop(categorical_features, axis =1)

y = df['Clicked on Ad']
X = df.drop(['Ad Topic Line', 'City', 'Country','Clicked on Ad'], axis=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state=101)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
```

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder

train_features = ['Daily Time Spent on Site', 'Age', 'Area Income',
                   'Daily Internet Usage', 'Male', 
                   'Month', 'Day', 'Weekday', 'Hour']

numeric_features = ['Daily Time Spent on Site', 'Age', 'Area Income',
                   'Daily Internet Usage']

#categorical_features = ['Male','Month', 'Day', 'Weekday', 'Hour']

scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

#ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
#df_ohe = ohe.fit_transform(df[categorical_features])
#df_ohe = pd.DataFrame(df_ohe, columns = [f'ohe_{i}' for i in range(df_ohe.shape[1])])
#df = pd.concat([df,df_ohe], axis=1)
#df = df.drop(categorical_features, axis =1)

X = df[train_features]
y = df['Clicked on Ad']
#X = df.drop(['Ad Topic Line', 'City', 'Country','Clicked on Ad'], axis=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state=101)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

In [None]:
x_train.head()

In [None]:
model1 = LogisticRegression(solver='lbfgs')
model1.fit(x_train, y_train)
predictions_LR = model1.predict(x_test)

print('\nLogistic regression accuracy:', accuracy_score(predictions_LR, y_test))
print('\nConfusion Matrix:')
print(confusion_matrix(predictions_LR, y_test))

In [None]:
print(classification_report(y_test, predictions_LR))

**Confusion Matrix:** The users that are predicted to click on commercials and the actually clicked were 111, the people who were predicted not to click on the commercials and actually did not click on them were 132.

The people who were predicted to click on commercial and actually did not click on them are 5, and the users who were not predicted to click on the commercials and actually clicked on them are 2.

We have only a few mislabelled points which is not bad from the given size of the dataset.

Classification Report:

From the report obtained, the precision & recall are 0.98 which depicts the predicted values are 98% accurate. Hence the probability that the user can click on the commercial is 0.98 which is a great precision value to get a good model.

In [None]:
from sklearn.tree import DecisionTreeClassifier

model2 = DecisionTreeClassifier()
model2.fit(x_train, y_train)
predictions_DT = model2.predict(x_test)

print('\nDecisionTreeClassifier accuracy:', accuracy_score(predictions_DT, y_test))
print('\nConfusion Matrix:')
print(confusion_matrix(predictions_DT, y_test))

In [None]:
print(classification_report(y_test, predictions_DT))

In [None]:
from xgboost import XGBClassifier

model3 = XGBClassifier()
model3.fit(x_train, y_train)
predictions_XGB = model3.predict(x_test)

print('\nXGBClassifier accuracy:', accuracy_score(predictions_XGB, y_test))
print('\nConfusion Matrix:')
print(confusion_matrix(predictions_XGB, y_test))

In [None]:
print(classification_report(y_test, predictions_XGB))

### Feature Importances

In [None]:
import lightgbm as lgb
from sklearn.model_selection import KFold
feature_importances = np.zeros(X.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', 
                           boosting_type = 'goss', 
                           n_estimators = 10000, 
                           class_weight = 'balanced')

# Fit the model twice to avoid overfitting
for i in range(2):
    
    # Split into training and validation set
    train_features, valid_features, train_y, valid_y = train_test_split(X,
                                                                        y,
                                                                        test_size = 0.25, 
                                                                        random_state = i)
    
    # Train using early stopping
    model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, 
                                                                               valid_y)], 
              eval_metric = 'auc', verbose = 200)
    predictions_LGB = model1.predict(valid_features)

    print('\nLGB accuracy:', accuracy_score(predictions_LGB, valid_y))
    print('\nConfusion Matrix:')
    print(confusion_matrix(predictions_LGB, valid_y))
    
    # Record the feature importances
    feature_importances += model.feature_importances_

Our confusion matrix tells us that the total number of accurate predictions is 158 + 141 = 299. On the other hand, the number of incorrect predictions is 27 + 4 = 31. We can be satisfied with the prediction accuracy of our model.

It can be concluded that the Decision Tree model showed better performances in comparison to the Logistic Regression model. The confusion matrix shows us that the 308 predictions have been done correctly and that there are only 22 incorrect predictions. Additionally, Decision Tree accuracy is better by about 3% in comparison to the first regression model.

In [None]:
# Make sure to average feature importances! 
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(X.columns),
                                    'importance': feature_importances}
                                  ).sort_values('importance', ascending = False)

feature_importances.head(10)

In [None]:
feature_importances.head(10)

In [None]:
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
feature_importances.tail()

In [None]:
def plot_feature_importances(df, threshold = 0.9):
    """
    Plots 10 most important features and the cumulative importance of features.
    Prints the number of features needed to reach threshold cumulative importance.
    
    Parameters
    --------
    df : dataframe
        Dataframe of feature importances. Columns must be feature and importance
    threshold : float, default = 0.9
        Threshold for prining information about cumulative importances
        
    Return
    --------
    df : dataframe
        Dataframe ordered by feature importances with a normalized column (sums to 1)
        and a cumulative importance column
    
    """
    
    plt.rcParams['font.size'] = 18
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()
    df['cumulative_importance'] = np.cumsum(df['importance_normalized'])

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    # Cumulative importance plot
    plt.figure(figsize = (8, 6))
    plt.plot(list(range(len(df))), df['cumulative_importance'], 'r-')
    plt.xlabel('Number of Features'); plt.ylabel('Cumulative Importance'); 
    plt.title('Cumulative Feature Importance');
    plt.show();
    
    importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
    print('%d features required for %0.2f of cumulative importance' % (importance_index + 1, threshold))
    
    return df

In [None]:
norm_feature_importances = plot_feature_importances(feature_importances,
                                                   threshold = 0.99)