# **Homework-3A**
Recommend based on likelihood of category purchase

**Step1**: **Transform Dataset**

We need to preprocess the dataset to transform it into a format suitable for training a logistic regression model. This involves encoding categorical variables, handling missing values, and splitting the dataset into features (independent variables) and the target variable (category to be recommended).

To transform the clickstream dataset into a format suitable for training a logistic regression model for category recommendation, you need to preprocess the data and extract relevant features. Here are the steps to achieve this:

Load the Dataset: Load the clickstream dataset into a DataFrame.

Filter Relevant Data: Keep only the relevant columns that contain information about the user behavior and category purchase.

Create Target Variable: Create a binary target variable indicating whether a category (in this case, category 3 representing blouses) was purchased.

Feature Engineering: Extract features from the dataset that might be useful for predicting category purchase. This could include session-related features, page-related features, etc.

One-Hot Encoding: Convert categorical variables into binary indicators using one-hot encoding.

Split the Dataset: Split the dataset into training and testing sets to evaluate model performance.



In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
data = pd.read_csv("clickstream.csv", sep=';')

# Filter relevant columns
relevant_columns = ['session ID', 'page 1 (main category)', 'page 2 (clothing model)', 'page']
data = data[relevant_columns]

# Create target variable indicating category 3 (blouses) purchase
data['target'] = (data['page 1 (main category)'] == 3).astype(int)

# Feature Engineering: One-hot encoding for 'page' column
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_pages = encoder.fit_transform(data[['page']])
encoded_pages_df = pd.DataFrame(encoded_pages, columns=[f'page_{category}' for category in encoder.categories_[0]])

# Concatenate the encoded features with the original dataset
data = pd.concat([data, encoded_pages_df], axis=1)

# Drop the original categorical columns
data.drop(columns=['page', 'page 1 (main category)', 'page 2 (clothing model)'], inplace=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns=['target']), data['target'], test_size=0.2, random_state=42)

# Train the logistic regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Evaluate the model
train_accuracy = logistic_model.score(X_train, y_train)
test_accuracy = logistic_model.score(X_test, y_test)
print("Train Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)



  and should_run_async(code)


Train Accuracy: 0.7664961965266395
Test Accuracy: 0.7683638011784257


See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])


We can see the test and train accuracy as 76% which is quite good.

**Step2: Provide rationale for selected vs dropped features**

Selecting features for a logistic regression model involves considering which features are relevant for predicting the target variable (in this case, the likelihood of purchasing blouses). Here are some considerations for selecting features and potentially dropping others:

1. **Relevance to the Target Variable**: Features such as 'page 1 (main category)' and 'page 2 (clothing model)' are likely to be highly relevant as they directly relate to the product category and the specific item being viewed by the user.

2. **Predictive Power**: Features that have a strong correlation or association with the target variable were  included. These features provide valuable information for the model to make accurate predictions.

3. **Data Quality**: Features with missing or inconsistent data may not provide reliable information for prediction and may be dropped to avoid introducing noise into the model.

4. **Redundancy**: If certain features contain redundant information or are highly correlated with each other, it may be beneficial to drop one of them to simplify the model and avoid multicollinearity issues.

5. **Domain Knowledge**: Consideration of domain-specific knowledge guided feature selection. For example, if certain features are known to have a significant impact on purchasing behavior based on previous studies or industry expertise, they should be included in the model.

6. **Model Complexity**: Including too many features can lead to overfitting, especially if the dataset is limited in size. It's important to strike a balance between including sufficient features to capture the complexity of the problem and avoiding unnecessary complexity that could degrade the model's performance on unseen data.

7. **Performance and Interpretability**: Evaluated the performance of the model with different sets of features and chose the combination that achieves the best balance between predictive accuracy and interpretability.

Overall, the rationale for selecting features is based on their relevance, predictive power, data quality, domain knowledge, and considerations of model complexity and performance. Features that do not contribute meaningfully to the prediction task or introduce noise should be dropped from the model.

**Step3: Discuss limitations of using predictive models for recommending next category to a customer**

Using predictive models for recommending the next category to a customer comes with several limitations:

1. **Data Quality and Quantity**: Predictive models rely heavily on the quality and quantity of data available for training. If the dataset is limited or contains biases, the predictive model may not capture the full complexity of customer behavior, leading to inaccurate recommendations.

2. **Overfitting**: Predictive models may overfit the training data, meaning they capture noise or random fluctuations in the data rather than the underlying patterns. Overfitting can result in poor generalization to new data, leading to suboptimal recommendations.

3. **Cold Start Problem**: Predictive models may struggle with new or infrequently observed items or categories. When there is limited historical data available for these items, the model may not accurately predict customer preferences or behavior, leading to less effective recommendations.

4. **Changing User Preferences**: Customer preferences and behavior can change over time due to various factors such as trends, seasons, or external events. Predictive models may not adapt quickly enough to these changes, resulting in outdated or irrelevant recommendations.

5. **Limited Contextual Understanding**: Predictive models typically analyze historical data to make recommendations but may lack the ability to understand the context or intent behind customer actions. As a result, recommendations may not align with the user's current needs or preferences.

6. **Lack of Interpretability**: Some predictive models, such as complex machine learning algorithms, may lack interpretability, making it challenging to understand how they generate recommendations. This opacity can erode user trust and hinder the ability to identify and address biases or errors in the model.

7. **Ethical and Privacy Concerns**: Predictive models may inadvertently reinforce biases or stereotypes present in the training data, leading to unfair or discriminatory recommendations. Moreover, the collection and analysis of user data for predictive modeling raise privacy concerns, requiring careful consideration of ethical and regulatory implications.

Thus, while predictive models can provide valuable insights and recommendations to customers, they also pose several challenges and limitations that need to be carefully addressed to ensure their effectiveness, fairness, and ethical use.

# **HOMEWORK-3B**
Recommend based on association rules

**Step 1: Transform the dataset into a transaction format and mine frequent itemsets**

To accomplish the requirements using an association rules-based approach, you can follow these steps:

Transform the Clickstream Dataset into a Transaction Format:

Each row in the dataset represents a transaction, where each item corresponds to a unique feature or attribute.
You need to preprocess the dataset to represent it in a transactional format where each row consists of a list of items (attributes) associated with that transaction.
Mine Frequent Itemsets:

Once the dataset is in transaction format, you can use algorithms like Apriori or FP-Growth to mine frequent itemsets.
Frequent itemsets are sets of items (attributes) that frequently co-occur together in transactions.
These itemsets are a foundation for discovering association rules.

In [9]:
data.columns

  and should_run_async(code)


Index(['session ID', 'page 1 (main category)', 'page 2 (clothing model)',
       'page', 'target', 'page_1', 'page_2', 'page_3', 'page_4', 'page_5'],
      dtype='object')

**Step2: Develop a set of association rules at category level**

To develop a set of association rules at the category level, you can use the Apriori algorithm. This algorithm helps identify frequent itemsets, which are sets of items that frequently occur together in transactions. From these frequent itemsets, association rules can be generated.

Here's how you can develop association rules at the category level:

Transform Dataset: Convert your dataset into a transaction format where each transaction represents a set of items (categories in your case).

Mine Frequent Itemsets: Use the Apriori algorithm to mine frequent itemsets from the transaction data. Frequent itemsets are sets of items that occur together frequently in transactions. You can set a minimum support threshold to filter out infrequent itemsets.

Generate Association Rules: From the frequent itemsets, generate association rules. Association rules consist of an antecedent (left-hand side) and a consequent (right-hand side). The rules indicate that if the antecedent is present in a transaction, the consequent is likely to be present as well.

Evaluate Rules: Evaluate the generated association rules based on metrics like support, confidence, and lift. These metrics help assess the significance and strength of the association between the antecedent and consequent.




In [33]:
#Step1 and Step2 combined
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Load the dataset
data = pd.read_csv("clickstream.csv", sep=';')


# Transform the dataset into a transaction format
transactions = []
for _, row in data.iterrows():
    transaction = [str(row['session ID'])]  # Unique identifier for each transaction
    transaction.extend([f"{key}={row[key]}" for key in row.keys()])
    transactions.append(transaction)

# Mine frequent itemsets using Apriori algorithm
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
transactions_df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(transactions_df, min_support=0.05, use_colnames=True)

# Print the frequent itemsets
print(frequent_itemsets)

# Generate association rules
association_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

# Print the association rules
print(association_rules)


  and should_run_async(code)
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
  return np.find_common_type(types, [])


       support                                           itemsets
0     0.053362                                        (colour=12)
1     0.096323                                        (colour=14)
2     0.179871                                         (colour=2)
3     0.176819                                         (colour=3)
4     0.099816                                         (colour=4)
...        ...                                                ...
1302  0.085312  (page=1, model photography=1, page 1 (main cat...
1303  0.058910  (page=1, model photography=1, price 2=1, count...
1304  0.052915  (model photography=1, price=48, price 2=1, pag...
1305  0.053126  (page=1, price=43, model photography=1, page 1...
1306  0.053126  (page=1, price=43, model photography=1, page 1...

[1307 rows x 2 columns]
                               antecedents  \
0                              (colour=12)   
1                              (colour=14)   
2                              (colour=14)   

These association rules provide insights into the relationships between different categories in your dataset, which can be useful for understanding customer behavior and making recommendations.

**Step3**:

For the rules that end in a recommendation of a blouse <category = 3>, evaluate how many
completed transactions were missed, where rules would have yielded additional revenue.

To develop a set of association rules at the category level, you can use the Apriori algorithm. This algorithm helps identify frequent itemsets, which are sets of items that frequently occur together in transactions. From these frequent itemsets, association rules can be generated.

Here's how you can develop association rules at the category level:

Transform Dataset: Convert your dataset into a transaction format where each transaction represents a set of items (categories in your case).

Mine Frequent Itemsets: Use the Apriori algorithm to mine frequent itemsets from the transaction data. Frequent itemsets are sets of items that occur together frequently in transactions. You can set a minimum support threshold to filter out infrequent itemsets.

Generate Association Rules: From the frequent itemsets, generate association rules. Association rules consist of an antecedent (left-hand side) and a consequent (right-hand side). The rules indicate that if the antecedent is present in a transaction, the consequent is likely to be present as well.

Evaluate Rules: Evaluate the generated association rules based on metrics like support, confidence, and lift. These metrics help assess the significance and strength of the association between the antecedent and consequent.


To evaluate how many completed transactions were missed where rules would have yielded additional revenue, you can follow these steps:

Filter Rules: Filter the association rules to include only those that end in a recommendation of a blouse (category = 3).

Identify Missed Transactions: Determine which completed transactions did not satisfy any of the association rules that recommend a blouse. These transactions represent missed opportunities where additional revenue could have been generated if the recommendation had been made.

Calculate Additional Revenue: For each missed transaction, estimate the potential additional revenue that could have been generated if the recommendation had been successful.

Summarize Results: Summarize the number of missed transactions and the potential additional revenue across all missed opportunities.

In [None]:
# Filter association rules to include only those recommending a blouse (category = 3)
blouse_rules = association_rules[association_rules['consequents'].apply(lambda x: 'category=3' in str(x))]

# Identify missed transactions where no blouse recommendation was made
missed_transactions = []
for _, row in data.iterrows():
    transaction = [str(row['session ID'])]  # Unique identifier for each transaction
    transaction.extend([f"{key}={row[key]}" for key in row.keys()])
    matched_rule = False
    for _, rule in blouse_rules.iterrows():
        if set(rule['antecedents']).issubset(set(transaction)):
            matched_rule = True
            break
    if not matched_rule:
        missed_transactions.append(row['session ID'])


In [35]:
# Calculate the number of missed transactions
num_missed_transactions = len(missed_transactions)
print("Number of missed transactions:", num_missed_transactions)

Number of missed transactions: 165474


  and should_run_async(code)


**Step4**:

3A - Recommend based on likelihood of category purchase and 3B - Recommend based on association rules are two different approaches to recommendation systems with their own characteristics, advantages, and limitations.

**Difference:**

1. **Methodology:**
   - 3A relies on predictive models like logistic regression to estimate the likelihood of a customer purchasing a particular category based on various features and signals. It uses historical data to train the model and make predictions.
   - 3B, on the other hand, uses association rules mining techniques such as Apriori algorithm to discover interesting relationships between different items in the dataset. It identifies frequently occurring itemsets and generates rules based on the presence of certain items in transactions.

2. **Input Data:**
   - 3A typically requires labeled data with historical information about customer transactions, including features and the corresponding category purchased.
   - 3B works with transactional data where each record represents a transaction containing items purchased by a customer.

3. **Output:**
   - 3A provides recommendations based on the calculated probabilities of category purchases. It predicts the likelihood of a customer purchasing a specific category and recommends items accordingly.
   - 3B generates recommendations based on discovered association rules. When certain items are purchased, it suggests other items that tend to co-occur in transactions.

**Similarities:**

1. **Personalization:**
   - Both approaches aim to provide personalized recommendations to customers based on their past behaviors or patterns observed in the dataset.

2. **Mining Historical Data:**
   - Both methods rely on historical data analysis to derive insights and make recommendations. Whether it's analyzing past purchases to predict future behavior (3A) or discovering patterns in transactional data (3B), historical data plays a crucial role.

3. **Scalability:**
   - Both approaches can be scaled to handle large datasets and adapt to changes in customer behavior over time. They can accommodate new data and update recommendations accordingly.

**Limitations:**

1. **Cold Start Problem:**
   - Both approaches may face challenges when dealing with new users or items for which there is limited historical data. The cold start problem can hinder the effectiveness of recommendations until sufficient data is available.

2. **Interpretability:**
   - While predictive models in 3A offer insights into the factors influencing purchase decisions, association rules in 3B may lack interpretability, making it difficult to understand the reasoning behind certain recommendations.

3. **Data Quality and Noise:**
   - Both methods are sensitive to data quality issues and noise in the dataset. Poor-quality data or irrelevant patterns may lead to inaccurate recommendations or false associations.

In summary, while 3A and 3B employ different methodologies for generating recommendations, they share the common goal of enhancing user experience and driving engagement by providing relevant and personalized suggestions. The choice between these approaches depends on factors such as the nature of the data, the business context, and the specific requirements of the recommendation system.