## Sales Conversion Analysis
The aim of this study is to optimize the social ad campaigns for the highest conversion rate possible
by analyzing and identifying the driving features that are indicative of conversion.

## Importing libraries and Reading dataset(s)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as mno
sns.set_style("darkgrid", {"grid.color": ".2", "grid.linestyle": ":"})

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
#reading data
data = pd.read_csv('../input/clicks-conversion-tracking/KAG_conversion_data.csv')

Rudimentary inspection of dataset 

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()



> There are no missing values, we can go ahead with further inspection.



In [None]:
#statistical summary of quantitative variables
data[['Impressions', 'Clicks', 'Spent', 'Total_Conversion', 'Approved_Conversion']].describe()

## Exploratory Data Analysis

In [None]:
#Function for visualizing catagorical variables, count plot
def count_plot(x, p = 'deep'):
    ax = sns.countplot(data[x],  palette = p)
    ax.set_title('{} Composition'.format(x), fontsize = 15, pad = 5)
    ax.patch.set_edgecolor('black')
    ax.patch.set_linewidth(1.2)  
    for k in ax.patches:
        ax.annotate('{:.1f}%'.format(k.get_height()/data.shape[0]*100),(k.get_x()+0.25, k.get_height()))

#Function for visualizing catagorical variables, bar plot
def bar_plot(x, y, p = 'deep'):
    ax = sns.barplot(data[x], data[y], estimator = np.mean, palette = p, ci = None)
    ax.set_title('Average {}'.format(y), fontsize = 15, pad = 5)
    ax.patch.set_edgecolor('black')
    ax.patch.set_linewidth(1.2)

#Function for visualizing catagorical variables, box plot
def box_plot(x):
    ax = sns.boxplot(y = data[x], color = 'tab:green', showfliers = True, showmeans = True)
    ax.set_title('{}'.format(x), fontsize = 12, pad = 5)
    ax.patch.set_edgecolor('black')
    ax.patch.set_linewidth(1.2)

def box_plot2(x, y, p = 'deep'):
    ax = sns.boxplot(data[x], data[y], palette = p, showmeans = True)
    ax.set_title('Distribution of {}'.format(y), fontsize = 15, pad = 5)
    ax.patch.set_edgecolor('black')
    ax.patch.set_linewidth(1.2)

### Outlier Analysis

In [None]:
plt.figure(figsize = [14,6])
cols = ['Impressions', 'Clicks', 'Spent', 'Total_Conversion', 'Approved_Conversion']
for i in range(5):
    plt.subplot(1,5,i+1)
    box_plot(cols[i])
plt.tight_layout()



> Evidently, all these variables have a tail of outliers, some of which are suspiciously high. We can get rid of some of the outliers as they can potentially effect the analysis, model and inferences.





In [None]:
data = data[~(data.Impressions > 1400000)]
data = data[~(data.Clicks > 250)]
data = data[~(data.Spent > 400)]
data = data[~(data.Total_Conversion > 25)]
data = data[~(data.Approved_Conversion > 8)]

In [None]:
data.shape

In [None]:
plt.figure(figsize = [14,6])
cols = ['Impressions', 'Clicks', 'Spent', 'Total_Conversion', 'Approved_Conversion']
for i in range(5):
    plt.subplot(1,5,i+1)
    box_plot(cols[i])
plt.tight_layout()



> Now that it looks relatively better, we can proceed with analysis. 



### Univariate, bivariate and Multivariate Analysis

In [None]:
#converting xyz_campaign_id, fb_campaign_id, interest to category datatype 
data[['ad_id', 'xyz_campaign_id', 'fb_campaign_id', 'interest']] = data[['ad_id', 'xyz_campaign_id', 'fb_campaign_id', 'interest']].astype('category')

In [None]:
#changing campaign codes to A,B,C
data['xyz_campaign_id'].replace({916: 'campaign_A', 936: 'campaign_B', 1178: 'campaign_C'}, inplace = True)

In [None]:
#company's campaingn ids' composition
plt.figure(figsize = [5.5,4.5])
with plt.style.context('seaborn-deep'):
    explode = (0.05, 0.05, 0.1)
    data.xyz_campaign_id.value_counts().plot(kind = 'pie', explode = explode, radius = 1.8, autopct='%0.1f%%', 
            startangle = 120, textprops={'fontsize': 14}, wedgeprops={"linewidth":2,"edgecolor":"k"}, shadow = True)
    plt.title('Distribution of Campaign_id', fontsize = 16, y = 1.3)
    plt.show()



> Around 54% of the ads are from campaign_C, 42% from campaign_B and remaining from campaign_A.



In [None]:
plt.figure(figsize = [14,6])
plt.subplot(1,2,1)
bar_plot('xyz_campaign_id', 'Clicks')
plt.subplot(1,2,2)
bar_plot('xyz_campaign_id', 'Approved_Conversion')



> More people have interacted with ads from campaign_C and it also has the highest average approved conversion, i.e, most people bought products in campaign_C.



In [None]:
#Gender composition in data
data.gender.value_counts(normalize = True)*100

In [None]:
#Distribution of gender
plt.figure(figsize = [18,6])
plt.subplot(1,3,1)
count_plot('gender')
plt.subplot(1,3,2)
bar_plot('gender', 'Clicks')
plt.subplot(1,3,3)
bar_plot('gender', 'Approved_Conversion')



> The percentage composition of Men is slightly more than Women. However, it looks like, on an average, women have made more clicks on ads than men.

> Eventhough the number of clicks made were more by women, the average approved conversions is almost same for both groups.



In [None]:
#Distribution of age-group
plt.figure(figsize = [20,6])
plt.subplot(1,3,1)
count_plot('age')
plt.subplot(1,3,2)
bar_plot('age', 'Clicks')
plt.subplot(1,3,3)
bar_plot('age', 'Approved_Conversion')



> According to the data, older people have click on more adds, but the approved conversion has exatly the opposite order.


> There can be many reason for this. The product that is advertised might be attractive to older population but of practical use only to relatively younger adults.



In [None]:
#Function for visualizing composition of categories of different qualitative variables
def hue_count(x, y):
    with plt.style.context('seaborn-muted'):
        ax = sns.countplot(data[x], hue = data[y])
        #plt.xticks(rotation = 45, ha = 'right')
        ax.set_title('Distribution plot of {}'.format(x), fontsize = 15)
        ax.patch.set_edgecolor('black')
        ax.patch.set_linewidth(1.2)  
        for k in ax.patches:
            ax.annotate('{:.1f}%'.format(k.get_height()/data[x].notnull().sum()*100),(k.get_x()+0.05, k.get_height()))
def hue_bar(x, y, z):
    with plt.style.context('seaborn-muted'):
        ax = sns.barplot(data[x], data[y], hue = data[z], ci = None)
        ax.patch.set_edgecolor('black')
        ax.patch.set_linewidth(1.2)  
        #plt.xticks(rotation = 45, ha = 'right')
        ax.set_title('Average {}'.format(y), fontsize = 15)

In [None]:
#age-group and gender
plt.figure(figsize = [18,6])
plt.subplot(1,2,1)
hue_count('age', 'gender')
plt.subplot(1,2,2)
hue_bar('age', 'Approved_Conversion', 'gender')



> People of age group 30-34 and 35-39 have a relatively high average approved conversion. These groups should be targeted, specialy men in age-group 30-34 and females in age-group 35-39.



In [None]:
#xyz_campaign_id and gender
plt.figure(figsize = [18,6])
plt.subplot(1,2,1)
hue_count('xyz_campaign_id', 'gender')
plt.subplot(1,2,2)
hue_bar('xyz_campaign_id', 'Approved_Conversion', 'gender')



> campaign_C has proved to be successfull in terms of reach, specially with the female population.



In [None]:
#Composition of Interest codes
plt.figure(figsize = [20,18])
plt.subplot(3,1,1)
count_plot('interest', 'hls')
plt.subplot(3,1,2)
bar_plot('interest', 'Clicks', 'hls')
plt.subplot(3,1,3)
bar_plot('interest', 'Approved_Conversion', 'hls')



> Clearly, people with interests in range [100, 114] have interacted the most with the ads and also have relatively higher average approved conversion. These people with these interests can be targetted more.



In [None]:
#pivot-table agrregating Approved Conversion using mean()
st = data.groupby(['interest','gender']).Approved_Conversion.mean().unstack()

In [None]:
st = st.apply(lambda x: x/sum(x), axis = 1)

In [None]:
#stacked bar chart again wrt to the proportions
plt.figure(figsize = [20,6])
st.plot(kind = 'bar', stacked = True, figsize = [20,8])



> Among the people with interest ids recommended in last analysis, more females can be targetted as there is more proportion of females with interests in range [100, 114]



In [None]:
plt.figure(figsize = [10,8])
cbar_kws={'orientation':'vertical', 'shrink':1,'extend':'max',
          'extendfrac':0.05, 'drawedges':True, 'pad':0.05, 'aspect':15}
sns.heatmap(data.corr(), annot = True, cmap = 'Reds', linecolor = 'k', linewidth = 0.2, cbar_kws = cbar_kws)
plt.xticks(rotation = 45, fontsize = 14)
plt.yticks(fontsize = 14)

In [None]:
#visualizing relationship between numerical variables using pairplot
cols = ['Impressions', 'Clicks', 'Spent', 'Total_Conversion', 'Approved_Conversion']
plt.figure(figsize = [8,8])
with plt.style.context('seaborn-whitegrid'):
    sns.pairplot(data, corner = True, plot_kws = {'alpha': 0.5})



> From heatmap of correlation matrix and scatterplots, there appears to be a very strong correlation between amount spent by company on ad and number of clicks registered and Impressions.



In [None]:
#visualizing wrt gender
plt.figure(figsize = [8,8])
with plt.style.context('seaborn-whitegrid'):
    sns.pairplot(data, hue = 'gender', corner = True, plot_kws = {'alpha': 0.7}, palette = 'Set1')

In [None]:
#visualizing wrt age-group
plt.figure(figsize = [8,8])
with plt.style.context('seaborn-whitegrid'):
    sns.pairplot(data, hue = 'age', corner = True, plot_kws = {'alpha': 0.7})

### More Multivariate Analysis

In [None]:
p1 = data.groupby(['age','gender','xyz_campaign_id']).Approved_Conversion.mean().unstack()
p1

In [None]:
#xyz_campaign_id vs age vs Approved_conversion
plt.figure(figsize = [10,8])
cbar_kws={'label':'Average Approved_Conversion',
          'orientation':'vertical', 'shrink':1,'extend':'max',
          'extendfrac':0.05, 'drawedges':True, 'pad':0.05, 'aspect':15}
sns.heatmap(p1, annot = True, cmap = 'Blues', center = 0.65, linecolor = 'k', linewidth = 0.2, cbar_kws = cbar_kws)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14, rotation = 0)



> campain_C has been successfull in reaching people, specially people in agegroup 30-34 and females in 35-39.


## Before Moving forward, let's define some new features.

To get a better picture, we need to analyse the following,

**Click-through rate** is the number of clicks that your ad receives divided by the number of times your ad is shown: 
*   **CTR** = Clicks / Impressions * 100

**Click conversion rate** is calculated as the number of conversions divided by clicks, expressed as a percentage.
*   **Conversion Rate** = Number of Conversions / Total number of Clicks * 100



In [None]:
#gender-wise analysis
temp = data.groupby('gender').agg({'Impressions':np.sum, 'Clicks':np.sum, 'Total_Conversion':np.sum, 'Approved_Conversion':np.sum})
temp['CTR'] = temp.Clicks/temp.Impressions*100
temp['Total_conversoin rate'] = temp.Total_Conversion/temp.Clicks*100
temp['Approved_conversoin rate'] = temp.Approved_Conversion/temp.Clicks*100
temp[['CTR','Total_conversoin rate','Approved_conversoin rate']]



> Eventhough CTR for men is relatively lower, they have relative much higher conversion rates. So, clearly, wormen tend to click more on ads but it's men who enquire about the product later or buy it.



In [None]:
#age-group wise analysis
temp = data.groupby('age').agg({'Impressions':np.sum, 'Clicks':np.sum, 'Total_Conversion':np.sum, 'Approved_Conversion':np.sum})
temp['CTR'] = temp.Clicks/temp.Impressions*100
temp['Total_conversoin rate'] = temp.Total_Conversion/temp.Clicks*100
temp['Approved_conversoin rate'] = temp.Approved_Conversion/temp.Clicks*100
temp[['CTR','Total_conversoin rate','Approved_conversoin rate']]



> Interestingly, older people tend to interact more with ads but people from age-group 30-34 and 35-39 are the ones who are more likely to buy the products.



In [None]:
#gender and age-group, combined analysis
temp = data.groupby(['age','gender']).agg({'Impressions':np.sum, 'Clicks':np.sum, 'Total_Conversion':np.sum, 'Approved_Conversion':np.sum})
temp['CTR'] = temp.Clicks/temp.Impressions*100
temp['Total_conversoin rate'] = temp.Total_Conversion/temp.Clicks*100
temp['Approved_conversoin rate'] = temp.Approved_Conversion/temp.Clicks*100
temp[['CTR','Total_conversoin rate','Approved_conversoin rate']]



> In age-group 30-34 and 35-39 specifically, men tend to enquire and buy more. This is true for all age-groups. Men in these age-groups should be targetted.



In [None]:
#analysis on Company's campaigns
temp = data.groupby(['xyz_campaign_id']).agg({'Spent':np.sum, 'Impressions':np.sum, 'Clicks':np.sum, 'Total_Conversion':np.sum, 'Approved_Conversion':np.sum})
temp['CTR'] = temp.Clicks/temp.Impressions*100
temp['Total_conversoin rate'] = temp.Total_Conversion/temp.Clicks*100
temp['Approved_conversoin rate'] = temp.Approved_Conversion/temp.Clicks*100
temp['Impression-spent ratio'] = round(temp.Impressions/temp.Spent/1000, 2)  #this metric has been scaled down and rounded for better interpretation and readability.
temp['Click-spent ratio'] = temp.Clicks/temp.Spent
temp['Total_conversoin-spent ratio'] = temp.Total_Conversion/temp.Spent
temp['Approved_conversoin-spent ratio'] = temp.Approved_Conversion/temp.Spent
temp[['CTR','Total_conversoin rate','Approved_conversoin rate','Impression-spent ratio','Click-spent ratio',
      'Total_conversoin-spent ratio','Approved_conversoin-spent ratio']]



> Contrary to our previous analysis, where we found that campaign_C was the most successfull ad campaign,


*   It was **campaign_A** which was most effective considering the amount of money spent on this campaign. It has significantly higher conversion rates, clicks-to-spent ratio and conversion_rate-to-spent ratio, all of which are highly desirable.
*   Not surprisingly, campaign_C was able to reach more people because of the amount of spent on ads in this campaign. campaign_A could have produced better results with same qality and quantity of resourses.









In [None]:
#Company's campaign, gender and age-group, combined analysis
temp = data.groupby(['xyz_campaign_id','gender','age']).agg({'Impressions':np.sum, 'Clicks':np.sum, 'Total_Conversion':np.sum, 'Approved_Conversion':np.sum})
temp['CTR'] = temp.Clicks/temp.Impressions*100
temp['Total_conversion rate'] = temp.Total_Conversion/temp.Clicks*100
temp['Approved_conversion rate'] = temp.Approved_Conversion/temp.Clicks*100

In [None]:
temp[['CTR']].unstack().plot(kind = 'bar', stacked = False, figsize = [12,6])
plt.xticks(rotation = 45, fontsize = 13)
plt.yticks(fontsize = 13)

In [None]:
temp[['Total_conversion rate']].unstack().plot(kind = 'bar', stacked = False, figsize = [12,6])
plt.xticks(rotation = 45, fontsize = 13)
plt.yticks(fontsize = 13)

In [None]:
temp[['Approved_conversion rate']].unstack().plot(kind = 'bar', stacked = False, figsize = [12,6])
plt.xticks(rotation = 45, fontsize = 13)
plt.yticks(fontsize = 13)



> Once again, we can observe here that females have more CTR in all three campaigns but men have more conversion rates.

> Younger males specifically, are more likely to convert than others.

> It is important to note that the sample size of females in age-group 40-44 is only 1 and therefore, not reliable.







In [None]:
data[(data.xyz_campaign_id == 'campaign_A') & (data.gender == 'F')].age.value_counts()

In [None]:
#Interest code analysis
temp = data.groupby(['interest']).agg({'Impressions':np.sum, 'Clicks':np.sum, 'Total_Conversion':np.sum, 'Approved_Conversion':np.sum})
temp['CTR'] = temp.Clicks/temp.Impressions*100
temp['Total_conversoin rate'] = temp.Total_Conversion/temp.Clicks*100
temp['Approved_conversoin rate'] = temp.Approved_Conversion/temp.Clicks*100

In [None]:
temp[['CTR']].plot(kind = 'bar', stacked = False, figsize = [16,5])
plt.xticks(rotation = 45, fontsize = 13)
plt.yticks(fontsize = 13)

In [None]:
temp[['Approved_conversoin rate']].plot(kind = 'bar', stacked = False, figsize = [16,5])
plt.xticks(rotation = 45, fontsize = 13)
plt.yticks(fontsize = 13)



> CTR is more or less consistent for all interest codes.


> From an earlier analysis, the average conversion count was more for interest codes in range [100, 114] possibly because these were the most reached groups in previous campaigns. However, conversion rate of of people with interest codes, {2,21,31,36,65,101,102} is the highest. People with these interests should be targetted in the next campaign.



## Final Summary and Recommendations

**Campaigns:**
1. Most Ads are from campaign_C. Around 54% of the ads are from campaign_C, 42% from campaign_B, and the remaining from campaign_A.
2. More people have interacted with ads from campaign_C and it also has the highest average approved conversion, i.e, most people bought products in campaign_C.
3. campaign_C has proved to be successful in terms of reach, especially with the female population.
4. According to CTR and Conversion rate analysis,
    * It was campaign_A which was most effective considering the amount of money spent on this campaign. It has significantly higher conversion rates, clicks-to-spent ratio, and conversion_rate-to-spent ratio, all of which are highly desirable.
    * Not surprisingly, campaign_C was able to reach more people because of the amount spent on ads in this campaign. campaign_A could have produced better results with the same quality and quantity of resources.

**Gender:**
1. The percentage composition of Men is slightly more than Women. However, it looks like, on an average, women have made more clicks on ads than men.
2. Even though the number of clicks made was more by women, the average approved conversions are almost the same for both groups.
3. Even though CTR for men is relatively lower, they have relatively much higher conversion rates. So, clearly, women tend to click more on ads but it's men who enquire about the product later or buy it.

**Age-group:**
1. According to the data, older people have clicked on more ads, but the average approved conversion has exactly the opposite order.
2. There can be many reasons for this. The product that is advertised might be attractive to the older population but of practical use only to relatively younger adults.
3. Interestingly, older people tend to interact more with ads but people from age-group 30-34 and 35-39 are the ones who are more likely to buy the products. These groups can be targeted in the next campaign.

**Money-spent:**
1. There is a strong positive correlation between money spent and impressions. This is expected because more is the money spent more is the number of times it’s seen.
2. There is also a weak correlation between money spent and conversion. This indicates that the campaigns were not optimized enough to ensure a strong positive correlation.

**Interests:**
1. people with interests in the range [100, 114] have interacted the most with the ads and also have relatively higher average approved conversion. These people with these interests can be targeted more.
2. The average conversion count was more for interest codes in the range [100, 114] possibly because these were the most reached groups in previous campaigns. However, conversion rate of of people with interest codes, {2,21,31,36,65,101,102} is the highest. People with these interests should be targeted in the next campaign.

**Other insights and Recommendations:**
1. In age-group 30-34 and 35-39 specifically, men tend to enquire and buy more. This is true for all age-groups. Men in these age-groups should be targeted.
2. It is observed that females have more CTR in all three campaigns but men have more conversion rates. Younger males specifically are more likely to convert than others. The next campaign should be more focused on men.
