# The spring-up and the prosperity of the digital economy

## Table of Contents:
1. [Background reading](#background-reading)
    
    
2. [Dataset](#dataset)
    
    
3. [Essence of Data](#essence-of-data)
    
    3.1 [Background](#background)
        
    3.2 [Google Ads and Ads Auction](#google-ads-and-ads-auction)
        
    3.3 [Gig Economy](#gig-economy)
        
    3.4 [Code Example](#code-example)
        
    
4. [AI in the fabric of society](#ai-in-the-fabric-of-society)

# The spring-up and the prosperity of the digital economy

## Background reading:

**Suggested Reading**

*Intro to Statistical Learning*: Chapter 9， Section 9.1, 9.2, 9.3, Support Vector Machine(Page 368-385)

**Optional Reading**

*Ethical Algorithm*: Chapter 3, Games People Play, Shopping With 300 Million Friends(Page 116-121)


## Dataset:
**2 Attributes:**

1. Session_id
2. DateTime
3. User_id
4. Product
5. Campaign_id
6. Webpage_id
7. Product_category_1
8. Product_category_2
9. User_group_id
10. gender
11. age_level
12. user_depth
13. city_development_index
14. var_1
15. is_click(The response we focused on. Whether the user click or not)

Datasets comes from: https://www.kaggle.com/datasets/arashnic/ctr-in-advertisement?resource=download

This tutorial aims at using SVM to classify whether the user click the advertisement or not. For the company like Google, it should determine what advertisement to recommend so that the click rate is higher, and how much it should charge from advertisement companies. Also, product company should determine how much benefits can be acquired with the advertisement.

## Essence of Data
### Background

At present, the digital economy has become a new form of economic and social development after the agricultural and industrial economy, as well as a typical representative of the new round of industrial revolution. Along with the development of cloud computing, big data, artificial intelligence and industrial Internet, a new round of information technology revolution is breaking out and the digital economy is rising.

 

Since the birth of the world's first general-purpose electronic computer at the University of Pennsylvania in 1946, it has opened the curtain of the digital era for mankind. However, as an economic form, the digital economy, in fact, has emerged as early as the development of the semiconductor industry. Today, the digital economy is ubiquitous in our lives, and innovative digital economy applications such as mobile payments, gig economy and advertisement auction have influenced every aspect of our daily lives.

 

Vigorous development of the digital economy has also become a global consensus. According to the World Internet Development Report 2018, the global digital economy reached US$12.9 trillion in 2017, with the United States and China ranking among the top two in the world. The new economy represented by the digital economy is now flourishing and has become a new engine to drive global economic growth.

**Interesting Article:** https://www.brookings.edu/research/the-fourth-industrial-revolution-and-digitization-will-transform-africa-into-a-global-powerhouse/


## Google Ads and Ads Auction

Advertisements are everywhere that we see, omnipresent presence makes them impressive and sometimes influence our expense plans. Unlike previous advertisement on television or newspaper, because of the prevalent usage of digital devices, people can see advertisements most of the time. 

One good example is **Google Ads**(formerly **Google AdWords**). When we use the search engine, some advertisements will be shown on the screen to attract users. To determine which advertisement to be shown to platform users, Google will determine by the bid (The maximum amount that company is willing to pay for a click), the quality of ads and the expected impact from the ad extensions and other ad formats. 

Thus, it is important also to determine which to display on the screen, determined by users' interests, for example. An example database of click rate is here for you to practise: https://www.kaggle.com/datasets/arashnic/ctr-in-advertisement

Example codes are also in the tutorial.


## Gig Economy

In such a busy and fast-pace era, new form of economy is burgeoning. According to the definition in dictionary, gig economy is "*an economic sector consisting of part-time, temporary and freelance jobs*" Gig economy is also ubiquitous like the daily platform we use, Uber, Uber Eats, Lyft, etc. 

Reading Reference: https://gadallon.substack.com/p/the-future-of-the-gig-economy-growth?r=zgog


## Code Example


In [1]:
# First, Import all packages
import numpy as np
import pandas as pd
import gensim
from gensim.models import KeyedVectors
from gensim import models
from sklearn.model_selection import train_test_split
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from sklearn.svm import SVC

In [2]:
# The data is already separated into training and testing
dataset_train = pd.read_csv('../../../data/Ad_click_prediction_train.csv')
dataset_test = pd.read_csv('../../../data/Ad_Click_prediciton_test.csv')
dataset_train

Unnamed: 0,session_id,DateTime,user_id,product,campaign_id,webpage_id,product_category_1,product_category_2,user_group_id,gender,age_level,user_depth,city_development_index,var_1,is_click
0,140690,2017-07-02 00:00,858557,C,359520,13787,4,,10.0,Female,4.0,3.0,3.0,0,0
1,333291,2017-07-02 00:00,243253,C,105960,11085,5,,8.0,Female,2.0,2.0,,0,0
2,129781,2017-07-02 00:00,243253,C,359520,13787,4,,8.0,Female,2.0,2.0,,0,0
3,464848,2017-07-02 00:00,1097446,I,359520,13787,3,,3.0,Male,3.0,3.0,2.0,1,0
4,90569,2017-07-02 00:01,663656,C,405490,60305,3,,2.0,Male,2.0,3.0,2.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463286,583588,2017-07-07 23:59,572718,H,118601,28529,5,82527.0,4.0,Male,4.0,3.0,2.0,0,0
463287,198389,2017-07-07 23:59,130461,I,118601,28529,4,82527.0,10.0,Female,4.0,3.0,2.0,1,0
463288,563423,2017-07-07 23:59,306241,D,118601,28529,4,82527.0,2.0,Male,2.0,3.0,,0,0
463289,595571,2017-07-07 23:59,306241,D,118601,28529,5,82527.0,2.0,Male,2.0,3.0,,0,0


In [3]:
# Checks to see how many NAs in train set
dataset_train.isna().sum()

session_id                     0
DateTime                       0
user_id                        0
product                        0
campaign_id                    0
webpage_id                     0
product_category_1             0
product_category_2        365854
user_group_id              18243
gender                     18243
age_level                  18243
user_depth                 18243
city_development_index    125129
var_1                          0
is_click                       0
dtype: int64

In [4]:
# Checks to see how many NAs in test set
dataset_test.isna().sum()

session_id                    0
DateTime                      0
user_id                       0
product                       0
campaign_id                   0
webpage_id                    0
product_category_1            0
product_category_2        76171
user_group_id              5684
gender                     5684
age_level                  5684
user_depth                 5684
city_development_index    34609
var_1                         0
dtype: int64

In [5]:
# Remove NA data
dataset_train_rm = dataset_train.dropna()
dataset_test_rm = dataset_test.dropna()
dataset_train_rm

Unnamed: 0,session_id,DateTime,user_id,product,campaign_id,webpage_id,product_category_1,product_category_2,user_group_id,gender,age_level,user_depth,city_development_index,var_1,is_click
17,2927,2017-07-02 00:03,295456,I,404347,53587,1,146115.0,9.0,Female,3.0,3.0,3.0,1,0
21,3803,2017-07-02 00:03,312475,I,404347,53587,1,146115.0,2.0,Male,2.0,3.0,4.0,1,0
42,2670,2017-07-02 00:05,649512,I,404347,53587,1,146115.0,2.0,Male,2.0,3.0,1.0,1,0
48,390567,2017-07-02 00:06,99306,H,105960,11085,5,270915.0,4.0,Male,4.0,3.0,2.0,0,0
49,381228,2017-07-02 00:06,99306,H,105960,11085,5,270915.0,4.0,Male,4.0,3.0,2.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463279,579414,2017-07-07 23:59,563083,H,118601,28529,5,82527.0,3.0,Male,3.0,3.0,2.0,0,0
463280,547394,2017-07-07 23:59,1132443,G,118601,28529,5,82527.0,3.0,Male,3.0,3.0,4.0,0,0
463281,393785,2017-07-07 23:59,12050,I,118601,28529,4,82527.0,3.0,Male,3.0,3.0,3.0,0,0
463286,583588,2017-07-07 23:59,572718,H,118601,28529,5,82527.0,4.0,Male,4.0,3.0,2.0,0,0


In [6]:
# Separates x and y
x_train = dataset_train_rm.loc[:,'product':'var_1']
x_test = dataset_test_rm.loc[:,'product':'var_1']

# One hot encode the two categorical features
dummies = pd.get_dummies(x_train, columns=['product','gender'])
merged = pd.concat([x_train,dummies],axis='columns')

dummies_test = pd.get_dummies(x_test, columns=['product','gender'])
merged_test = pd.concat([x_test,dummies_test],axis='columns')

# Drops unnecessary columns
x_test = merged_test.drop(['product','gender'], axis='columns')
x_train = merged.drop(['product','gender'], axis='columns')

x_train = x_train.drop(['campaign_id', 'webpage_id', 'user_group_id'], axis='columns')
x_test = x_test.drop(['campaign_id', 'webpage_id', 'user_group_id'], axis='columns')

print(x_train)

# Creates y train and test
y_train = dataset_train_rm.loc[:,'is_click']

# This needs to be created to be able to check test set accuracy but the original data is missing the outcome variable.
#y_test = dataset_test_rm.loc[:,'is_click']

        product_category_1  product_category_2  age_level  user_depth  \
17                       1            146115.0        3.0         3.0   
21                       1            146115.0        2.0         3.0   
42                       1            146115.0        2.0         3.0   
48                       5            270915.0        4.0         3.0   
49                       5            270915.0        4.0         3.0   
...                    ...                 ...        ...         ...   
463279                   5             82527.0        3.0         3.0   
463280                   5             82527.0        3.0         3.0   
463281                   4             82527.0        3.0         3.0   
463286                   5             82527.0        4.0         3.0   
463287                   4             82527.0        4.0         3.0   

        city_development_index  var_1  product_category_1  product_category_2  \
17                         3.0      1     

In [7]:
# Start SVM classifier
svm_classifier = SVC()
svm_classifier.fit(x_train, y_train)

In [8]:
# Check the accuracy on training set
svm_classifier.score(x_train, y_train)

0.9368841178089733

In [9]:
# Predicts the response for the test data
y_pred = svm_classifier.predict(x_test)

In [10]:
# You would ordinarily run this next but can't with this incomplete data set that does not include y_test...

# Checks the accuracy on test set
from sklearn import metrics
#print("Test set accuracy:",metrics.accuracy_score(y_test, y_pred))

## Check to see if there is an equal breakdown of male and female data points in each data set (train vs test).
If this does not hold true, the analysis may not be generalizable.

In [11]:
#Get proportions of male and female observations in the training dataset
gender_percent_train = dataset_train.gender.value_counts() / len(dataset_train)
print(gender_percent_train)

Male      0.849259
Female    0.111364
Name: gender, dtype: float64


In [12]:
#Get proportions of male and female observations in the test dataset
gender_percent_test = dataset_test.gender.value_counts() / len(dataset_test)
print(gender_percent_test)

Male      0.842206
Female    0.113683
Name: gender, dtype: float64


In [13]:
#We know that dataset_train contains 463290 observations
dataset_test
#Now we have determined that dataset_test contains 128857 observations

Unnamed: 0,session_id,DateTime,user_id,product,campaign_id,webpage_id,product_category_1,product_category_2,user_group_id,gender,age_level,user_depth,city_development_index,var_1
0,411705,2017-07-08 00:00,732573,J,404347,53587,1,,5.0,Male,5.0,3.0,,0
1,208263,2017-07-08 00:00,172910,I,118601,28529,3,82527.0,,,,,,1
2,239450,2017-07-08 00:00,172910,I,118601,28529,4,82527.0,,,,,,1
3,547761,2017-07-08 00:00,557318,G,118601,28529,5,82527.0,1.0,Male,1.0,3.0,1.0,0
4,574275,2017-07-08 00:00,923896,H,118601,28529,5,82527.0,9.0,Female,3.0,1.0,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128853,215328,2017-07-09 21:29,252148,B,414149,45962,2,254132.0,2.0,Male,2.0,3.0,4.0,0
128854,282232,2017-07-09 21:29,47955,D,98970,6970,4,,1.0,Male,1.0,3.0,,0
128855,140499,2017-07-09 21:29,314236,C,359520,13787,4,,2.0,Male,2.0,3.0,,0
128856,531038,2017-07-09 21:29,988544,E,98970,6970,2,,2.0,Male,2.0,3.0,,0


In [14]:
#Run a two-proportion z test:
import math
p_hat = (0.849259*463290+0.842206*128857)/(463290+128857)
print('p_hat =', p_hat)

p_hat = 0.8477241979643569


In [15]:
z = ((0.849259-0.842206)-0)/math.sqrt(p_hat*(1-p_hat)*((1/463290)+(1/128857)))
print ('z =', z)

z = 6.232996032078319


With a z-score so high, the probability of there being a difference in gender breakdown across the two groups is nearly 0. Therefore, gender should not be a factor that impedes the ability for an algorithm trained on the training set to accurately predict the test set.

## AI in the fabrics of society

### Gig Economy

In such a busy and fast-pace era, new form of economy is burgeoning. According to the definition in dictionary, gig economy is "*an economic sector consisting of part-time, temporary and freelance jobs*" Gig economy is also ubiquitous like the daily platform we use, Uber, Uber Eats, Lyft, etc. 

Reading Reference: https://gadallon.substack.com/p/the-future-of-the-gig-economy-growth?r=zgog

Reading: *Ethical Algorithm*: Chapter 3, Games People Play, Shopping With 300 Million Friends(Page 116-121)