**BEFORE YOU BEGIN,** please work from a copy of this notebook in your Google Drive *(File > Save a copy in Drive)*.

If you skip this step, any changes you make **will not be saved.**






---

# **AI and Marketing**
## How brands are using AI to better understand consumers

---

# **1:Introduction**

This labs notebook, created for a [Technovation, Code with me - AI](https://www.technovationmontreal.com/artificial-intelligence/en) workshop, guides you through data exploration and simple AI model creation. It uses sample data that is similar to behavioral data captured on real website or mobile application.

This is an introductory activity, so all of the code has already been written for you -- all you need to do is click a button to run it! We've also included comments about what the code is doing so you can start to get a hang of what Python functions look like.

## What's Colab?

*Colaboratory*, or **Colab** for short, is a free Google product that allows anybody to write and execute arbitrary Python code through the browser (for free!). Each .ipynb file is called a Colab "notebook," and can be stored in Google Drive just like Google Docs or Sheets, which you may already be familiar with.

If you haven't seen or interacted with this kind of document before, don't worry! Just keep following the tutorial in this section and you'll be good to go for the rest of the activity.

###Understanding Notebook Structure

The building blocks of every notebook are called *cells.* If you click once on this paragraph, you can see an outline of the cell it lives in. **Try clicking some of the paragraphs or headers from earlier in the notebook to see the separate cells.**

There are two types of cells: *text* and *code*, which are both editable. *Text cells* contain formatted text, as you've seen, and *code cells* contain executable Python code.

Let us demonstrate...

***This is a text cell.***

In [120]:
# And this is a code cell.

Next, you'll learn how to create, edit, and move around cells yourself!

###*Working with Cells*

**To CREATE a new cell, hover your mouse over the bottom edge of this cell.** You should see one button for a new code cell, and another for a new text cell.


**First, create a new text cell below and write a random sentence or two.** Maybe write about what you ate yesterday, or the video you last watched on YouTube.

You may notice something a little funny about the way the text you input looks vs. the text that's actually displayed on the cell. Each text cell is formatted using a syntax that's called Markdown. All it basically does is help mark formatted text (like **bolded** or *italicized* text) in a way that a computer can understand.

But don't worry about memorizing the syntax, because you can just use the icons that show up on the top of the cell when you start typing. **Feel free to play around with the icons to understand how they change your text.**

***To stop editing, simply click on a different cell.***

---

**To EDIT an existing cell, double click it.** Try it on the cell you just created.

**To MOVE an existing cell, first select it, then use the ↑ or ↓ arrows on the upper right of the cell to change its position.** Try moving your cell below this one.

**To DELETE an existing cell, first select it, then click the trash can icon at the upper right of the cell.** Try it on the cell you created.

You may be curious about the other symbols in the top right corner. Feel free to explore what they do.

*If you ever want to undo a cell deletion, you can use the Edit dropdown menu above.*

Hopefully this section helped you become familiar with working with a notebook document. The last thing you need to learn in order to complete this activity is how to run the code, which we'll dive into next.

###*Running Code Cells*

In this activity, we have already written out all the code for you. While you don't need to write any code, you will have to run each cell yourself.

**To RUN a code cell, click the [ ▶️ ] icon that appears on the left of the cell when you hover over it with your mouse.**

Give it a try below:

In [121]:
# <--- hover your mouse here!

# Run this block to print a message!
# (Side note: any lines in green like this one are just comments we can add within a code cell to explain what we're doing)

print("Hello world!")
print("**** Bienvenue à l'atelier 'Code Avec Moi - IA' ****")

Hello world!
**** Bienvenue à l'atelier 'Code Avec Moi - IA' ****


**If the code has any output, it'll show up underneath the block, where** `Hello world!` **is now.**

You may also notice that a number appeared in between the brackets `[ ]`.

This number just confirms that a cell has finished running and also tells you the order in which code cells were run. `[1]` means this cell was the first one run, and `[2]` was the second cell run, etc.

For this activity, you also don't need to worry about trying to figure out exactly how the code works —— we'll explain what each block is doing at each step, so all you need to do is just run them!

Let's try to get some information using Python code. <BR/>**Run the cell below to get current day and time**

In [122]:
import datetime
import pytz

# Specify the Montreal timezone
montreal_tz = pytz.timezone('America/Montreal')

# Get the current datetime in Montreal
current_datetime = datetime.datetime.now(montreal_tz)

# Extract the day and format the time nicely
current_day = current_datetime.strftime("%A")  # Example: "Friday"
current_time = current_datetime.strftime("%H:%M:%S")  # Example: "11:25:45"

print("Current day in Montreal:", current_day)
print("Current time in Montreal:", current_time)

Current day in Montreal: Friday
Current time in Montreal: 17:05:48


Now that you're a bit more familiar with how a Colab notebook works, let's learn about the dataset we'll be exploring.


## About The Dataset

Mobile application and websites usually capture user activities to get insights and metrics on digital traffic.
<br/>
This dataset used in this notebook emulate a web ecommerce implementation. It does not not includes Personally Identifiable Information (PII) but only browsing data (online interactions).






Now, let's run some code to take a closer look at how our information is structured.
<br/>

---

# 2: Import data

First, we'll import a few libraries to let the notebook know what specific tools we'll need to work with our data.

**You *must* run the cell below before continuing the activity!** (Otherwise, the computer won't understand which specific functions we want to use from each library.)

In [123]:
# Import libraries
import io
import pandas as pd
import plotly.express as px

print("Libraries successfully imported!")

Libraries successfully imported!


Next, we'll upload our dataset to the notebook. The file is stored somewhere in the cloud, on GitHub. <br/> **Run the cell below, and verify that you get a message that says `Dataset successfully loaded`**

In [124]:
# The URL to identify the file
file_url = 'https://raw.githubusercontent.com/cloudvitamin/Technovation/main/visitor_data_training_micro.csv'

# The file is loaded into a DataFrame object (df).
# A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet.
df = pd.read_csv(file_url)

# Get number of rows and columns
rows = len(df.index)
# columns = len(df.columns)
print(f"Dataset successfully loaded ({rows} rows)")

Dataset successfully loaded (50000 rows)


Now that we've uploaded the dataset, let's look into how it's structured.

**Run the next cell to see how many rows and columns are in the dataset.**

The output will be formatted as **`(number of rows, number of columns)`**.

In [125]:
# See the shape/structure of the dataset: Python code to get number of rows, followed by the number of columns
df.shape

(50000, 25)

No we want to know more about the different columns in our dataset.
<br/> **Run the next cell to list the different columns and their type (object, float, integer, etc)**

In [126]:
# See the columns and their type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   UserID                   50000 non-null  object
 1   basket_icon_click        50000 non-null  int64 
 2   basket_add_list          50000 non-null  int64 
 3   basket_add_detail        50000 non-null  int64 
 4   sort_by                  50000 non-null  int64 
 5   image_picker             50000 non-null  int64 
 6   account_page_click       50000 non-null  int64 
 7   promo_banner_click       50000 non-null  int64 
 8   detail_wishlist_add      50000 non-null  int64 
 9   list_size_dropdown       50000 non-null  int64 
 10  closed_minibasket_click  50000 non-null  int64 
 11  checked_delivery_detail  50000 non-null  int64 
 12  checked_returns_detail   50000 non-null  int64 
 13  sign_in                  50000 non-null  int64 
 14  saw_checkout             50000 non-nul

Ok, but what is we want to see the data? **We can take a peek at the first few rows of the dataset by running the following cell**
</BR> You will see that the dataset if full of 0 and 1 to indicate if app/web app visitors performed specific actions (it shows visitor online behavior).

In [127]:
# Show top 5 rows in dataframe
df.head(5)

Unnamed: 0,UserID,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,...,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,location,ordered
0,a720-6b732349-a720-4862-bd21-644732,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,a0c0-6b73247c-a0c0-4bd9-8baa-797356,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,86a8-6b735c67-86a8-407b-ba24-333055,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
3,6a3d-6b736346-6a3d-4085-934b-396834,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,b74a-6b737717-b74a-45c3-8c6a-421140,0,1,0,1,0,0,0,0,1,...,0,0,0,1,0,0,1,0,1,1


# 3: **Kickstart your curiosity**

Since we now have access to the data, we can start asking some questions.

**Maybe we want to know who is a new visitor and who is a returning visitor (someone who already visisited in the past).**

In [128]:
# Filters rows based on returning_user coulumn
returning_visitor = len(df[df['returning_user']==1])
new_visitor = len(df[df['returning_user']==0])

print(f"There are {returning_visitor} returning visitors in the source dataset")
print(f"There are {new_visitor} new visitors in the source dataset")


There are 26835 returning visitors in the source dataset
There are 23165 new visitors in the source dataset


Or, maybe we want to get the percentage of people who clicked on a specific **promotion banner**

In [129]:
# The ratio of people who click on the banner out of all the visitors
click_promo_percentage = len(df[df['promo_banner_click']==1]) / len(df) * 100
print(f"The percentage of people who clicked on a promotion banner is: {click_promo_percentage}%")



The percentage of people who clicked on a promotion banner is: 1.69%


There are so, so, *so* many other questions you ask about this data.

**Take a moment to list some other questions you might have about this dataset. What do you want to know?**

*Enter your response by editing the cell below.*

<< ***STUDENT RESPONSE*** >>

[type answer here]

---

Many of the best insights we can gather from data involve looking at trends and patterns --- things that we humans are quite keen at identifying. But it's a little challenging to spot interesting patterns just by looking at a table with words and numbers.

That's where *data visualization* comes in. In the next sections, we'll explore 3 different ways of using code to manipulate our dataset and create charts that can help reveal intriguing insights.

# 4: **Visualize data**

In this section, we'll look at devices used by mobile app visitors

We create a graph to represent the number of visitors based on the device category.

In [130]:
import plotly.graph_objects as go
import pandas as pd
from plotly.subplots import make_subplots


df_graph = df[['device_mobile', 'device_computer', 'device_tablet']]
fig = make_subplots(rows=1, cols=3, subplot_titles=('Mobile', 'Computer', 'Tablet'))
L= len(df)

cnames = list(df_graph.columns)
for k, name in enumerate(cnames):
    n_true = df[name].sum()
    fig.add_trace(go.Bar(x=['True', 'False'], y=[n_true, L-n_true], name=name ), 1,k+1)

fig.update_layout(barmode='relative',  bargap=0.05)

fig.show()

### Let's Get Analytical

**Given the graph above, answer the following questions in the cell below:**

(You can hover over each bar for additional information.)

1. How many different device categories are included in the dataset
2. How many people are browsing from a desktop?
3. What percentage of visitors are using a tablet?
4. (Bonus question) Are you able to pinpoint an error in the dataset regarding devices?

<< ***STUDENT RESPONSE*** >>

1.
2.
3.

In [131]:
fig = px.histogram(df, x="ordered", title="Non-buyers vs Buyers")
fig.update_layout(barmode='relative',  bargap=0.05, width=700, height=400)
fig.show()

In [132]:
# Overall, wwat is the approximate percentage of buyers in our dataset?

# Filters rows based on Buyer coulumn
df_buyers = df[df['ordered']==1]

# Get number of rows in the new dataframe
nb_buyers = len(df_buyers)
nb_visitors = len(df)

buyer_percentage = (nb_buyers/nb_visitors)*100
print(f"The percentage of buyers in our dataset is: {buyer_percentage}%")

The percentage of buyers in our dataset is: 4.198%


# 5: **Create an AI model to predict buyers**

Our dataset contain historical information about web app visitors. It includes a variable to identify if a visitor became a buyer or not (the`'ordered'` column).

<br/> In this section, we will create a model using the available data to predict if a visitor is likely to become a buyer, just using browsing information available in the dataset.
We will train the model to differentiate buyers from non-buyers based on historical data that includes the 'buyer' information. This is called **supervised learning**.

In [133]:
# Preprocessing
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   UserID                   50000 non-null  object
 1   basket_icon_click        50000 non-null  int64 
 2   basket_add_list          50000 non-null  int64 
 3   basket_add_detail        50000 non-null  int64 
 4   sort_by                  50000 non-null  int64 
 5   image_picker             50000 non-null  int64 
 6   account_page_click       50000 non-null  int64 
 7   promo_banner_click       50000 non-null  int64 
 8   detail_wishlist_add      50000 non-null  int64 
 9   list_size_dropdown       50000 non-null  int64 
 10  closed_minibasket_click  50000 non-null  int64 
 11  checked_delivery_detail  50000 non-null  int64 
 12  checked_returns_detail   50000 non-null  int64 
 13  sign_in                  50000 non-null  int64 
 14  saw_checkout             50000 non-nul

Logostic regression

In [134]:
# To train the model, we remove 'ordered' as it is the variable the model will predict.
# We also remove 'UserID' as we don't need user identifiers to train this model.
X = df.drop(['ordered','UserID'], axis=1)  # Features used to train the model

# Target variable: the variable the model will predict
y = df['ordered']



In [135]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   basket_icon_click        50000 non-null  int64
 1   basket_add_list          50000 non-null  int64
 2   basket_add_detail        50000 non-null  int64
 3   sort_by                  50000 non-null  int64
 4   image_picker             50000 non-null  int64
 5   account_page_click       50000 non-null  int64
 6   promo_banner_click       50000 non-null  int64
 7   detail_wishlist_add      50000 non-null  int64
 8   list_size_dropdown       50000 non-null  int64
 9   closed_minibasket_click  50000 non-null  int64
 10  checked_delivery_detail  50000 non-null  int64
 11  checked_returns_detail   50000 non-null  int64
 12  sign_in                  50000 non-null  int64
 13  saw_checkout             50000 non-null  int64
 14  saw_sizecharts           50000 non-null  int64
 15  sa

In [136]:
# Spli data to have some for training and some for testing (usually 80-20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Diplay info on the training dataset
X_train.info()
X_train.head(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 47441 to 10446
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   basket_icon_click        40000 non-null  int64
 1   basket_add_list          40000 non-null  int64
 2   basket_add_detail        40000 non-null  int64
 3   sort_by                  40000 non-null  int64
 4   image_picker             40000 non-null  int64
 5   account_page_click       40000 non-null  int64
 6   promo_banner_click       40000 non-null  int64
 7   detail_wishlist_add      40000 non-null  int64
 8   list_size_dropdown       40000 non-null  int64
 9   closed_minibasket_click  40000 non-null  int64
 10  checked_delivery_detail  40000 non-null  int64
 11  checked_returns_detail   40000 non-null  int64
 12  sign_in                  40000 non-null  int64
 13  saw_checkout             40000 non-null  int64
 14  saw_sizecharts           40000 non-null  int64
 15

Unnamed: 0,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,closed_minibasket_click,...,saw_checkout,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,location
47441,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1
29626,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
49798,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
39694,1,1,1,0,0,0,0,0,1,0,...,0,0,0,0,1,1,0,0,1,1
35243,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,1,1
21920,0,1,0,0,0,0,0,0,1,0,...,1,0,0,0,1,0,1,0,0,1
26354,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
32513,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
6831,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,1,1
4075,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


The next cell is where we do the AI/ML model training. We use a machine learning algorithm called **XGBoost**.


In [137]:
# Import Python libraries that are required for AI/ML training
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# XGBoost model training. XGBClassifier is used as we do a classification (buyer vs non-buyer)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Predictions based on data not used during training
y_pred = model.predict(X_test)

# Evaluation - Display model accuracy
# The accuracy score represents the percentage of correct predictions made by a model out of all predictions.
# A simple formula of model accuracy is: Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9927


An accuracy score is always between 0 (the model is alway wrong) and 1 (the model is 100% correct).

A confusion matrix is a table used to visualize the performance of a classification model in machine learning (models that predict categories).
</br> Here we use the confusion matrix to better understand the model predictions

In [138]:
# Using confusion matrix to better understand model predictions
cm = confusion_matrix(y_test, y_pred)
# print(cm) ## Uncomment to display the confusion matrix

total_visitors = len(y_test)
# Extract metrics from the confusion matrix
total_buyers = cm[1,0] + cm[1,1]
total_non_buyers = cm[0,0] + cm[0,1]

print(f"There are {total_visitors} visitors in the test dataset. {total_buyers} of them became buyers while {total_non_buyers} others did not.")
print(f"Out of the {total_buyers} real buyers, the model correcly identified {cm[1,1]} of them as buyers and made a wrong prediction for {cm[1,0]} of them (non-buyers).")
print(f"Out of the {total_non_buyers} non-buyers, the model correcly identified {cm[0,0]} of them as non-buyers and made a wrong prediction for {cm[0,1]} of them (buyers).")

model_accuracy = ((cm[0,0]+cm[1,1])/ total_visitors) * 100
print(f"The model accuracy (percentage of correct predictions) is: {model_accuracy}%")

There are 10000 visitors in the test dataset. 414 of them became buyers while 9586 others did not.
Out of the 414 real buyers, the model correcly identified 403 of them as buyers and made a wrong prediction for 11 of them (non-buyers).
Out of the 9586 non-buyers, the model correcly identified 9524 of them as non-buyers and made a wrong prediction for 62 of them (buyers).
The model accuracy (percentage of correct predictions) is: 99.27%


The different results show that model predictions are not always correct.

# 6: **Use your model for predictions on new online visitors**

In [139]:
# Get random rows from a dataset
df_sample = X_test.sample(n=50)
df_sample

Unnamed: 0,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,closed_minibasket_click,...,saw_checkout,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,location
33236,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,1
40311,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
31062,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,1,0,0,1,1
23277,1,0,1,0,0,0,0,0,1,0,...,1,0,0,0,0,1,0,0,1,1
34249,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,1,0,0,1
14708,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,1,1,0,1,1
25053,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
8364,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
18709,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
12749,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1


In [140]:
# Display info on the test sample dataset
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 33236 to 7069
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   basket_icon_click        50 non-null     int64
 1   basket_add_list          50 non-null     int64
 2   basket_add_detail        50 non-null     int64
 3   sort_by                  50 non-null     int64
 4   image_picker             50 non-null     int64
 5   account_page_click       50 non-null     int64
 6   promo_banner_click       50 non-null     int64
 7   detail_wishlist_add      50 non-null     int64
 8   list_size_dropdown       50 non-null     int64
 9   closed_minibasket_click  50 non-null     int64
 10  checked_delivery_detail  50 non-null     int64
 11  checked_returns_detail   50 non-null     int64
 12  sign_in                  50 non-null     int64
 13  saw_checkout             50 non-null     int64
 14  saw_sizecharts           50 non-null     int64
 15  sa

In [141]:
predictions = model.predict(df_sample)

df_visitor_prediction = df_sample.copy()
df_visitor_prediction['predicted'] = predictions

In [142]:

# Display the list of visitors where the AI model predict they will become buyer
df_visitor_prediction[df_visitor_prediction['predicted']==1]

Unnamed: 0,basket_icon_click,basket_add_list,basket_add_detail,sort_by,image_picker,account_page_click,promo_banner_click,detail_wishlist_add,list_size_dropdown,closed_minibasket_click,...,saw_sizecharts,saw_delivery,saw_account_upgrade,saw_homepage,device_mobile,device_computer,device_tablet,returning_user,location,predicted
4834,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,1,1


**CONGRATULATIONS!!** You have created and tested an predictive AI model.
</br> The IA model can predict if a visitor will become a buyer or not.
</br>
Now you realize how our online activity can be used to better understand us. With simpe data, AI models can help predict consumer behaviors.
