<a href="https://colab.research.google.com/github/anumit-web/ML-Analytics-Portfolio-2024/blob/main/8.%20Market%20Basket%20Analysis/Market_Basket_Analysis_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial Intelligence
# Machine Learning
# Portfolio Project
# #8
# Market Basket Analysis
# Goal = Conduct market basket analysis to identify product associations and customer buying patterns.

# unsupervised learning

# find what item a customer is most likely to buy based on information about customer purchase history data

# Apriori algorithm
## a machine learning algorithm used to find frequent itemsets and association rules in data

# Identify frequent items
## identifying the most frequent individual items in the data

# Association rules
## a rule might say that if items A and B are in a transaction, then item C is likely to be included as well

## The Apriori algorithm uses the insight that adding items to a frequently purchased group can only make it less frequent.

## The algorithm requires two important parameters: minimum support and minimum confidence

# support
## Support basically refers to the number of times the chosen item/s appears in the database

# Confidence
## Confidence refers to the frequency of A and B being together, given the number of times A has occurred.

# Lift
## the likelihood of the itemset B being purchased when item A is purchased while taking into account the support of B

# Market Basket Analysis
## customers who buy a certain item (or group of items) are more likely to buy another specific item (or group of items)

## The relations hence can be used to increase profitability through cross-selling, recommendations, promotions, or even the placement of items on a menu or in a store.


# https://en.wikipedia.org/wiki/Apriori_algorithm

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -


# Branding statement

In [None]:
# https://i.ibb.co/zZswY34/Pink-hands-network-2.png

In [None]:
from IPython.display import Image
Image('https://i.ibb.co/zZswY34/Pink-hands-network-2.png')

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# EDA, Exploratory Data Analysis

## import libraries

In [None]:
print('Hello, Market Basket Analysis')

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
from mpl_toolkits.mplot3d import Axes3D
import networkx as nx

## Input data

In [None]:
#

input_data_df = pd.read_csv("https://zenodo.org/records/13898853/files/Groceries_dataset.csv",
                            sep = ',')

In [None]:
number_of_rows = input_data_df.shape[0]
number_of_columns = input_data_df.shape[1]
print("Numner of rows = ", number_of_rows)
print("Number of columns = ", number_of_columns)

In [None]:

input_data_df.head()

In [None]:
input_data_df.tail()

In [None]:
input_data_df.info()

In [None]:
input_data_df.describe()


print column names

In [None]:
print(input_data_df.columns)

In [None]:
# Get all Column Header Labels as List
for column_headers in  input_data_df.columns:
    print(column_headers)

## find list of all columns which have null values
using SKIMPY

In [None]:
# Get the count of null values in each column
null_counts = input_data_df.isnull().sum()
print('Priniting count of null values = ')
print(null_counts)

In [None]:
try:
  import skimpy
except:
  !pip install skimpy
  import skimpy

In [None]:
# ! pip install skimpy

In [None]:
from skimpy import skim

In [None]:
skim(input_data_df)

## find unique values by column names

In [None]:
# Find the number of unique values in both 'Name' and 'Age' columns
unique_values = input_data_df.nunique()
print(unique_values)

In [None]:
try:
  import summarytools
except:
  !pip install summarytools
  import summarytools

In [None]:
# ! pip install summarytools

In [None]:
from summarytools import dfSummary

In [None]:
dfSummary(input_data_df)

# Data Processing

## convert date column to date data type



In [None]:
# print
input_data_df.head()

In [None]:
input_data_df["Date"]=pd.to_datetime(input_data_df["Date"])

# input_data_df["Date"] = pd.to_datetime(input_data_df["Date"], format='%d.%m.%Y %H:%M')

In [None]:
input_data_df.head()

In [None]:
input_data_df.dtypes

In [None]:
input_data_df['itemDescription'] = input_data_df['itemDescription'].apply(str)

# Change the 'Age' column type to float
# input_data_df['Price'] = pd.to_numeric(input_data_df['Price'], downcast='float')

In [None]:
input_data_df.dtypes

In [None]:
input_data_df.dtypes

In [None]:
import missingno as msno

# Visualize missing values as a matrix
msno.matrix(input_data_df)

In [None]:
# exit(0)

change index to date tyoe and date column

In [None]:
input_data_df.head()

do NOT drop date column because we need it for model training for linear and random forest algorithms

# Data Visualization 📊📈📉

## Total Sales Analysis

We group the data by month and year, calculating the total sum of sales to understand the sales trend over time. The resulting visualization depicts the total sales per month

## Top 20 products sold


In [None]:
product_counts = input_data_df["itemDescription"].value_counts()

In [None]:
product_counts.head(20)

In [None]:
product_counts.head(20).plot(kind='bar', figsize=(10, 6), color="purple")
plt.title('Top 20 ProductName Distribution')
plt.xlabel('ProductName')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# exit(0)

# Data Calculations

## convert dataframe another dataframe where each row is a transaction

In [None]:
processed_data_df = input_data_df.copy()

In [None]:
processed_data_df['itemDescription'] = processed_data_df['itemDescription'].transform(lambda x: [x])

In [None]:
processed_data_df.head()

In [None]:
processed_data_df = processed_data_df.groupby(['Member_number','Date']).sum()['itemDescription'].reset_index(drop=True)

In [None]:
processed_data_df.head()

In [None]:
processed_data_df.tail()

In [None]:
encoder = TransactionEncoder()
transactions = pd.DataFrame(encoder.fit(processed_data_df).transform(processed_data_df), columns=encoder.columns_)

transactions.head()

In [None]:
frequent_itemsets_df = apriori(transactions, min_support= 6/len(processed_data_df), use_colnames=True, max_len = 2)


frequent_itemsets_df.head()

In [95]:
rules_df = association_rules(frequent_itemsets_df, metric="lift",  min_threshold = 1.5)

In [96]:
rules_df.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(butter milk),(UHT-milk),0.017577,0.021386,0.000601,0.034221,1.600131,0.000226,1.013289,0.381761
1,(UHT-milk),(butter milk),0.021386,0.017577,0.000601,0.028125,1.600131,0.000226,1.010854,0.383247
2,(cream cheese ),(UHT-milk),0.023658,0.021386,0.000869,0.036723,1.717152,0.000363,1.015922,0.427761
3,(UHT-milk),(cream cheese ),0.021386,0.023658,0.000869,0.040625,1.717152,0.000363,1.017685,0.426767
4,(soda),(artif. sweetener),0.097106,0.001938,0.000468,0.004818,2.485725,0.00028,1.002893,0.661986


In [97]:
processed_data_df.tail()

Unnamed: 0,itemDescription
14958,"[butter milk, whipped/sour cream]"
14959,"[bottled water, herbs]"
14960,"[fruit/vegetable juice, onions]"
14961,"[bottled beer, other vegetables]"
14962,"[soda, root vegetables, semi-finished bread]"


## Business Application

### whole milk

find items wnich drive sales of whole milk

In [98]:
milk_rules_df = rules_df[rules_df['consequents'].astype(str).str.contains('whole milk')]
milk_rules_df = milk_rules_df.sort_values(by=['lift'],ascending = [False]).reset_index(drop = True)

milk_rules_df.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(brandy),(whole milk),0.00254,0.157923,0.000869,0.342105,2.166281,0.000468,1.279957,0.53975
1,(softener),(whole milk),0.00274,0.157923,0.000802,0.292683,1.853328,0.000369,1.190523,0.461695
2,(canned fruit),(whole milk),0.001403,0.157923,0.000401,0.285714,1.809201,0.000179,1.178908,0.447899
3,(syrup),(whole milk),0.001403,0.157923,0.000401,0.285714,1.809201,0.000179,1.178908,0.447899
4,(artif. sweetener),(whole milk),0.001938,0.157923,0.000535,0.275862,1.746815,0.000229,1.162868,0.42836


In [99]:
exit(1)

# Data Calculations

## strategy
1. Select one of the columns to test and train = high, low, volumne
2. Select input data into test and train
3. Run your models
    a. Linear regression
    b. Random Forest
4. Stop Words
5. Stemming
6. Lemmatization




Tokenization

## Linear Regression

date vs volume

predict values of volume

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables

independent variable = date
dependent variable = volume

In [100]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

In [101]:
input_data_df.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,2015-07-21,tropical fruit
1,2552,2015-01-05,whole milk
2,2300,2015-09-19,pip fruit
3,1187,2015-12-12,other vegetables
4,3037,2015-02-01,whole milk


In [102]:
x = input_data_df.copy()
y = input_data_df['Close']

KeyError: 'Close'

In [None]:
type(x)

In [None]:
type(y)

In [None]:
# x2 = x.to_frame()
x2 = x

# print
x2

In [None]:
# y2 = y.to_frame()
y2 = y

#print
y2

In [None]:
# x3 = x2.reset_index(drop=True)
x3 = x2
# drop column 'B'
x3 = x3.drop('Date', axis=1)

# print
x3

In [None]:
# y3 =  y2.reset_index(drop=True)
y3 = y2

# print
y3

In [None]:
print('Data type = ( x3 = )',type(x3),  'y3 =', type(y3))

In [None]:
# Train the model
model.fit(x3, y3)
# model.fit(x3.values.reshape(-1, 1), y3.values.reshape(-1, 1))


# Evaluate the model
# r2_score = model.score(x3.values.astype(float).reshape(-1, 1), y3.values.reshape(-1, 1))
r2_score = model.score(x3, y3)
print(f"R-squared value: {r2_score}")

## Random Forest

In [None]:
model_rand = RandomForestRegressor(n_estimators=200, random_state = 42)

In [None]:
model_rand.fit(x, y)

In [None]:
r2_score-rf = model.score(x3, y3)
print(f"R-squared value: {r2_scor_rf}")

The R-squared value measures how well the linear regression model fits the data, ranging from 0 to 1, where 1 indicates a perfect fit.

In [None]:
text_location = input_data_df.columns.get_loc('text')

# Create a new column 'Country'
input_data_df['text_processed'] = ""

# Insert the 'Country' column after the 'Name' column
input_data_df.insert(text_location + 1, 'text_processed', input_data_df.pop('text_processed'))

text_processed_location = input_data_df.columns.get_loc('text_processed')

# Create a new column 'Country'
input_data_df['text_processed_2'] = ""

# Insert the 'Country' column after the 'Name' column
input_data_df.insert(text_processed_location + 1, 'text_processed_2', input_data_df.pop('text_processed_2'))

text_processed_location_2 = input_data_df.columns.get_loc('text_processed_2')

# Create a new column 'Country'
input_data_df['text_processed_3'] = ""

# Insert the 'Country' column after the 'Name' column
input_data_df.insert(text_processed_location_2 + 1, 'text_processed_3', input_data_df.pop('text_processed_3'))


# print

input_data_df.head()

# print

input_data_df.head()

## 1. Lower case string

In [None]:
input_data_df['text_processed'] = input_data_df['text'].str.lower()

input_data_df.head()

## 2. tokenize sentenses

In [None]:
import nltk
nltk.download('all')

In [None]:
from nltk.tokenize import word_tokenize

# Tokenize the 'text' column and put the result in a new column 'tokens'
input_data_df['text_processed'] = input_data_df['text_processed'].apply(word_tokenize)

input_data_df.head()

## punctuation removal

In [None]:

input_data_df['text_processed_2'] = input_data_df['text_processed'].copy()

input_data_df.head()



In [None]:
input_data_df.loc[:, 'text_processed_3'] = input_data_df.loc[:, 'text_processed_2']

input_data_df.head()

In [None]:
import string

# input_data_df['text_processed_2'] = input_data_df['text_processed'].copy()

def remove_punctuation(tokens):

  tokens2 = []
  for word in tokens:
    #if (word in string.punctuation):
      #tokens.remove(word)
    if(word not in string.punctuation):
      tokens2.append(word)

  #return tokens
  return tokens2
  # return ""

# input_data_df['text_processed_2'] = input_data_df['text_processed']
input_data_df['text_processed_3'] = input_data_df['text_processed_3'].apply(remove_punctuation)

input_data_df.head(5)

# input_data_df['text', 'text_processed'].head()

## stop words removal

In [None]:
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [None]:

def remove_stop_words(tokens):

  tokens2 = []
  for word in tokens:
    #if (word in stopwords.words('english')):
       #tokens.remove(word)
    if ( word not in stopwords.words('english')):
         tokens2.append(word)

  #return tokens
  return tokens2

input_data_df['text_processed_3'] = input_data_df['text_processed_3'].apply(remove_stop_words)

input_data_df.head()

## 5. stemming

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [None]:
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english") # use this algorithm as it is new version of porter

def stemwords_in_sentence(tokens):
  stemmed_words = []
  for word in tokens:
      stemmed_word = snowball_stemmer.stem(word)
      stemmed_words.append(stemmed_word)


  return stemmed_words

input_data_df['text_processed_3'] = input_data_df['text_processed_3'].apply(stemwords_in_sentence)

input_data_df.head()

## 6. lemming

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

def lemmatize_in_sentence(tokens):
  lemmatized_words = []
  for word in tokens:
      lemmatized_word =   lemmatizer.lemmatize(word)
      lemmatized_words.append(lemmatized_word)


  return lemmatized_words

input_data_df['text_processed_3'] = input_data_df['text_processed_3'].apply(lemmatize_in_sentence)

input_data_df.head()


## 7. Join tokens into string

In [None]:
# input_data_df['text_processed_3'] =

In [None]:
text_processed_location_3 = input_data_df.columns.get_loc('text_processed_3')

# Create a new column 'Country'
input_data_df['text_processed_4'] = ""

# Insert the 'Country' column after the 'Name' column
input_data_df.insert(text_processed_location_3 + 1, 'text_processed_4', input_data_df.pop('text_processed_4'))



In [None]:
input_data_df['text_processed_4'] = input_data_df['text_processed_3'].apply(lambda token: ' '.join(token))

In [None]:
input_data_df.head()

## Change column names

## fill mean average values in rows and columns

## Process combined data from control group and test group

# Data Calculations

## Find sentiment of tweets and posts

## VADER (Valence Aware Dictionary and sEntiment Reasoner)

negative, neutral, and positive scores

compound score can range from -1 to 1.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [None]:
sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores("Wow, NLTK is really powerful!")

In [None]:
sia.polarity_scores("I had a bad experience at the geocery store!")

In [None]:
sia.polarity_scores("I am going to office")

In [None]:
# create get_sentiment function

def get_sentiment(text):

    scores = sia.polarity_scores(text)

    if scores['compound'] > 0 :
        new_sentiment = 1
    if scores['compound'] == 0 :
        new_sentiment = 0
    if scores['compound'] < 0 :
        new_sentiment = -1

    return new_sentiment


# apply get_sentiment function

input_data_df['calculated_sentiment'] = input_data_df['text_processed_4'].apply(get_sentiment)

input_data_df.head()

In [None]:
# Group by the 'Name' column and count the number of rows in each group
result =  input_data_df.groupby('calculated_sentiment').size()

print(result)

In [None]:
sentiment_location = input_data_df.columns.get_loc('sentiment')

# Create a new column 'Country'
input_data_df['sentiment_to_number'] = 9999999

# Insert the 'Country' column after the 'Name' column
input_data_df.insert(sentiment_location + 1, 'sentiment_to_number', input_data_df.pop('sentiment_to_number'))

input_data_df.head()

In [None]:
def convert_sentiment_to_number(sentiment):

  sentiment_number = 66666666

  if(sentiment == 'positive'):
    sentiment_number = 1
  if(sentiment == 'neutral'):
    sentiment_number = 0
  if(sentiment == 'negative'):
     sentiment_number = -1

  return sentiment_number
  # return ""

# input_data_df['text_processed_2'] = input_data_df['text_processed']
input_data_df['sentiment_to_number'] = input_data_df['sentiment'].apply(convert_sentiment_to_number)

input_data_df.head(5)

## Confusion marrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cf_matrix = confusion_matrix(input_data_df['sentiment_to_number'],
                             input_data_df['calculated_sentiment'])
print(cf_matrix)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(input_data_df['sentiment_to_number'], input_data_df['calculated_sentiment']))

In [None]:
import seaborn as sns
sns.heatmap(cf_matrix, annot=True)

In [None]:
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
            fmt='.2%', cmap='Blues')

In [None]:
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = [“{0:0.0f}”.format(value) for value in
                cf_matrix.flatten()]
group_percentages = [“{0:.2%}”.format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f”{v1}\n{v2}\n{v3}” for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

Conversion Rate

Conversion Rate = (Number of Conversions / Number of Visitors) x 100

add new columns
normalized = value between 0 and 1
percent = value between 0 and 100

In [None]:
combined_data_df['CTR_Normalized'] = 0
combined_data_df['CTR_Percent'] = 0
combined_data_df['CR_Normalized'] = 0
combined_data_df['CR_Percent'] = 0

#print
combined_data_df.head()

In [None]:
combined_data_df['CTR_Normalized'] = (combined_data_df['Clicks'] /
                      combined_data_df['Impressions'])

combined_data_df['CTR_Percent'] = (combined_data_df['Clicks'] /
                      combined_data_df['Impressions']) * 100

# print
combined_data_df.head()

In [None]:
combined_data_df['CR_Normalized'] = (combined_data_df['Purchases'] /
                      combined_data_df['Clicks'])

combined_data_df['CR_Percent'] = (combined_data_df['Purchases'] /
                      combined_data_df['Clicks']) * 100


#print
combined_data_df.head()

In [None]:
#CTR

temp_df1 = combined_data_df.groupby('Campaign Name')['CTR_Normalized'].mean()


In [None]:
temp_df2 = combined_data_df.groupby('Campaign Name')['CR_Normalized'].mean()

In [None]:
temp_df3 = combined_data_df.groupby('Campaign Name')['CTR_Percent'].mean()

In [None]:
temp_df4 = combined_data_df.groupby('Campaign Name')['CR_Percent'].mean()

In [None]:
temp_df5 = pd.merge(temp_df1, temp_df2, on='Campaign Name')

#print
temp_df5

In [None]:
temp_df6 = pd.merge(temp_df3, temp_df4, on='Campaign Name')

In [None]:
temp_df7 = pd.merge(temp_df5, temp_df6, on='Campaign Name')

# print
temp_df7

In [None]:
calculations_df = temp_df7

# print
temp_df7

seperator

seperator

# Visualization

## CTR , Click through rate

### Bar chart of CTR

In [None]:
colors = sns.color_palette(['#06C', '#F4B678'])

sns.barplot(data=calculations_df, x='Campaign Name',
                    y='CTR_Percent', hue='Campaign Name', palette = colors, dodge=False)
plt.title('Average Metrics of CTR')
plt.show()

### Box chart of CTR

In [None]:
colors = sns.color_palette(['#73C5C5', '#A30000'])

sns.boxplot(x='Campaign Name', y='CTR_Normalized', data=combined_data_df,
            hue='Campaign Name', dodge=False, palette = colors)
plt.title('CTR Distribution by Campaign')
plt.show()

### Violin chart of CTR

In [None]:
colors = sns.color_palette(['#7CC674', '#F0AB00'])

sns.violinplot(x='Campaign Name', y='CTR_Normalized', data=combined_data_df,
            hue='Campaign Name', dodge=False, palette = colors)
plt.title('CTR Distribution by Campaign')
plt.show()

## CR, Conversion rate

### Bar chart of CR

In [None]:
colors = sns.color_palette(['#06C', '#F4B678'])

sns.barplot(data=calculations_df, x='Campaign Name',
                    y='CR_Percent', hue='Campaign Name', palette = colors, dodge=False)
plt.title('Average Metrics of CR')
plt.show()

### Box chart of CR

In [None]:
colors = sns.color_palette(['#73C5C5', '#A30000'])

sns.boxplot(x='Campaign Name', y='CR_Normalized', data=combined_data_df,
            hue='Campaign Name', palette = colors, dodge=False)
plt.title('CR Distribution by Campaign')
plt.show()

### Violin chart of CR

In [None]:
colors = sns.color_palette(['#7CC674', '#F0AB00'])

sns.violinplot(x='Campaign Name', y='CR_Normalized', data=combined_data_df,
            hue='Campaign Name', palette = colors, dodge=False)
plt.title('CR Distribution by Campaign')
plt.show()

# Analysis and Conclusions

**CTR (Click-Through Rate):**

Visual Observation: The histogram for CTR shows a higher mean for the Test Campaign compared to the Control Campaign.
The KDE line for the Test Campaign is consistently above that of the Control Campaign.
There are more outliers on the right for the Test Campaign than for the Control Campaign.
Conclusion: The Test Campaign has a higher CTR, indicating better engagement.



**CR (Conversion Rate):**

Visual Observation: The histogram for CR shows similar means for both Control and Test Campaigns, with the Test Campaign
having a slightly lower mean. The KDE lines overlap considerably, indicating similar distributions.
Conclusion: There is no significant difference in CR between the Control and Test Campaigns.

# The End 🛑