## Recommendation Systems - Content-Based 
This notebook builds a **Content-Based Recommendation Model** to suggest similar magazines based on their descriptions and categories. By analyzing the content of each magazine, the model recommends items with similar attributes.  

### **Approach:**  
1. **Data Preparation**  
   - Extracted magazine **descriptions** and **categories** from the dataset.  
   - Cleaned text by removing **duplicates, punctuation, and special characters**.  
   - Performed **keyword extraction** from descriptions using the **RAKE library**.  
   - Created a **bag of words** by combining category labels and extracted keywords.  

2. **Modeling**  
   - Used **CountVectorizer** to convert the bag of words into numerical vectors.  
   - Computed **cosine similarity** between magazines to measure content similarity.  
   - Given a magazine, the model retrieves the most similar magazines based on content.  

3. **Evaluation**  
   - Compared model-generated recommendations with Amazon's actual recommendations.  
   - **2 out of 5** recommendations matched Amazon’s suggested products.  

This content-based approach enables personalized recommendations based on magazine descriptions and categories rather than user behavior.  


In [3]:
# %matplotlib inline
import numpy as np
import pandas as pd

import nltk
from sklearn.feature_extraction.text import CountVectorizer

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [81]:
pd.set_option('display.max_colwidth',50)

In [17]:
!pip install rake-nltk
from rake_nltk import Rake
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [83]:
df3 = pd.read_json('/content/drive/MyDrive/Machine Learning- Group Project/meta_Magazine_Subscriptions.json', lines=True)
print(df3.shape)
df3.head(2)

(3385, 19)


Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,"[Magazine Subscriptions, Professional & Educat...",,[REASON is edited for people interested in eco...,,"<span class=""a-size-medium a-color-secondary""","[B002PXVYLE, B01MCU84LB, B000UHI2LW, B01AKS14A...",,Reason Magazine,[],[],"[B002PXVYLE, B000UHI2LW, B01MCU84LB, B002PXW18...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,NaT,,B00005N7NQ,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,"[Magazine Subscriptions, Arts, Music &amp; Pho...",,[Written by and for musicians. Covers a variet...,,"<span class=""a-size-medium a-color-secondary""","[B002PXVYGE, B0054LRNC8, B000BVEELE, B00006KC3...",,String Letter Publishers,[],742 in Magazine Subscriptions (,"[B002PXVYGE, B0054LRNC8, B00006L16A, 171906487...","{'Format:': 'Print Magazine', 'Shipping: ': 'C...",Magazine Subscriptions,,NaT,,B00005N7OC,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


In [107]:
meta = df3[['asin','description','category']]
print(meta.shape)

(3385, 3)


In [108]:
#how many duplicate items
print("how many duplicate items: ",meta.duplicated(subset=['asin']).sum())
#dropping duplicate items
meta.drop_duplicates(subset=['asin'],inplace=True)

how many duplicate items:  1065


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [109]:
# how many items with no categ info; removing them
print("how many items with no categ info: ",meta[meta['category'].str.len() == 0].shape) 
meta = meta[meta['category'].str.len() != 0]
print("final shape:",meta.shape)

how many items with no categ info:  (442, 3)
final shape: (1878, 3)


In [110]:
print("how many items with no desc info: ",meta[meta['description'].str.len() == 0].shape)
#remove them
meta = meta[meta['description'].str.len() != 0]
print("final shape:",meta.shape)

how many items with no desc info:  (234, 3)
final shape: (1644, 3)


In [111]:
#take out description string stored as list and save as string to new column
meta['desc2'] = meta['description'].apply(lambda x: "".join(x))

In [112]:
pd.set_option('display.max_colwidth',300)
print("how many, duplicated descriptions: ",meta.duplicated(subset=['desc2']).sum())
print("visual check:")
meta[meta.duplicated(subset=['desc2'], keep=False)].sort_values('desc2').head(4)

how many, duplicated descriptions:  154
visual check:


Unnamed: 0,asin,description,category,desc2
2360,B000LE0O74,"[<I>Selvedge</I> magazine offers fine textile photography, cutting edge reports on contemporary textile art, detailed profiles of internationally renowned artists, articles on ethnographic textiles, critical coverage of the fashion and design industry, news, book reviews, and information on exhi...","[Magazine Subscriptions, Professional & Educational Journals, Professional & Trade, Arts, Photography]","<I>Selvedge</I> magazine offers fine textile photography, cutting edge reports on contemporary textile art, detailed profiles of internationally renowned artists, articles on ethnographic textiles, critical coverage of the fashion and design industry, news, book reviews, and information on exhib..."
2181,B0009GJBD2,"[<I>Selvedge</I> magazine offers fine textile photography, cutting edge reports on contemporary textile art, detailed profiles of internationally renowned artists, articles on ethnographic textiles, critical coverage of the fashion and design industry, news, book reviews, and information on exhi...","[Magazine Subscriptions, Professional & Educational Journals, Professional & Trade, Arts, Photography]","<I>Selvedge</I> magazine offers fine textile photography, cutting edge reports on contemporary textile art, detailed profiles of internationally renowned artists, articles on ethnographic textiles, critical coverage of the fashion and design industry, news, book reviews, and information on exhib..."
3367,B01H0LP47U,"[<i>Women&#x2019;s Health</i> gives you the tools you need to get healthier, sexier, stronger, slimmer&#x2014;and stay that way, all year long! Every issue inspires you with fresh fitness secrets, mealtime makeovers, and beauty and fashion tips to bring out your inner confidence. You'll find 15-...","[Magazine Subscriptions, Sports, Recreation & Outdoors, Sports & Leisure, Training]","<i>Women&#x2019;s Health</i> gives you the tools you need to get healthier, sexier, stronger, slimmer&#x2014;and stay that way, all year long! Every issue inspires you with fresh fitness secrets, mealtime makeovers, and beauty and fashion tips to bring out your inner confidence. You'll find 15-m..."
3056,B010EJ6MSU,"[<i>Women&#x2019;s Health</i> gives you the tools you need to get healthier, sexier, stronger, slimmer&#x2014;and stay that way, all year long! Every issue inspires you with fresh fitness secrets, mealtime makeovers, and beauty and fashion tips to bring out your inner confidence. You'll find 15-...","[Magazine Subscriptions, Sports, Recreation & Outdoors, Sports & Leisure, Training]","<i>Women&#x2019;s Health</i> gives you the tools you need to get healthier, sexier, stronger, slimmer&#x2014;and stay that way, all year long! Every issue inspires you with fresh fitness secrets, mealtime makeovers, and beauty and fashion tips to bring out your inner confidence. You'll find 15-m..."


In [113]:
#drop duplicates desc
meta.drop_duplicates(subset=['desc2'],inplace=True)
print("final shape:",meta.shape)

final shape: (1490, 4)


In [114]:
#clean up the description text
import re
import string

def clean_text(text):
    
    text = text.lower()
    text = re.sub('<[^>]*>', ' ', text)  # removes all <> and text within
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # removes punctuations
    text = re.sub('\w*\d\w*', '', text) #removes numbers
    text = re.sub('[‘’“”♪…]', '', text) #removes "" 
    text = "".join(c for c in text if ord(c)<128)  # remove non ascii characters
            
    return text

In [115]:
meta['clean_desc'] = meta['desc2'].apply(lambda x: clean_text(x))

In [116]:
#remove the first item in category list which is same for all
meta['categ2'] = meta['category'].apply(lambda x: x[1:])

In [117]:
#clean up categories to format needed for bag of words format
import re
import string

def clean_categ(categ_list):
  new_list = []
  for category in categ_list:
    text = category.lower()
    text = re.sub(' & ', '', text)  #" & "
    text = re.sub(' &amp; ', '', text) #" &amp; "
    text = re.sub(' ', '', text) #removes space " "
    text = re.sub(',', '', text) #removes comma
    new_list.append(text)

  return new_list

In [118]:
meta['clean_categ'] = meta['categ2'].apply(lambda x: clean_categ(x))

In [119]:
pd.set_option('display.max_colwidth',100)
print(meta.shape)
meta.head(3)

(1490, 7)


Unnamed: 0,asin,description,category,desc2,clean_desc,categ2,clean_categ
0,B00005N7NQ,"[REASON is edited for people interested in economic, social, and international issues. Viewpoint...","[Magazine Subscriptions, Professional & Educational Journals, Professional & Trade, Humanities &...","REASON is edited for people interested in economic, social, and international issues. Viewpoint ...",reason is edited for people interested in economic social and international issues viewpoint str...,"[Professional & Educational Journals, Professional & Trade, Humanities & Social Sciences, Econom...","[professionaleducationaljournals, professionaltrade, humanitiessocialsciences, economicseconomic..."
1,B00005N7OC,[Written by and for musicians. Covers a variety of musical styles and includes transcriptions fr...,"[Magazine Subscriptions, Arts, Music &amp; Photography, Music]",Written by and for musicians. Covers a variety of musical styles and includes transcriptions fro...,written by and for musicians covers a variety of musical styles and includes transcriptions from...,"[Arts, Music &amp; Photography, Music]","[artsmusicphotography, music]"
2,B00005N7OD,[Allure is the beauty expert. Every issue is full of celebrity tips and insider secrets from the...,"[Magazine Subscriptions, Fashion &amp; Style, Women]",Allure is the beauty expert. Every issue is full of celebrity tips and insider secrets from the ...,allure is the beauty expert every issue is full of celebrity tips and insider secrets from the p...,"[Fashion &amp; Style, Women]","[fashionstyle, women]"


### Key word extraction with Rake

In [120]:
# from rake_nltk import Rake
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')


#extract keywords from description
meta['key_words'] = ''
r = Rake()
for index, row in meta.iterrows():
    r.extract_keywords_from_text(row['clean_desc'])
    key_words_dict_scores = r.get_word_degrees()
    row['key_words'] = list(key_words_dict_scores.keys())

In [121]:
meta['bag_of_words'] = ''
columns = ['clean_categ', 'key_words']
for index, row in meta.iterrows():
  words = ''
  for col in columns:
      words += ' '.join(row[col]) + ' '
  row['bag_of_words'] = words

In [122]:
meta.head(2)

Unnamed: 0,asin,description,category,desc2,clean_desc,categ2,clean_categ,key_words,bag_of_words
0,B00005N7NQ,"[REASON is edited for people interested in economic, social, and international issues. Viewpoint...","[Magazine Subscriptions, Professional & Educational Journals, Professional & Trade, Humanities &...","REASON is edited for people interested in economic, social, and international issues. Viewpoint ...",reason is edited for people interested in economic social and international issues viewpoint str...,"[Professional & Educational Journals, Professional & Trade, Humanities & Social Sciences, Econom...","[professionaleducationaljournals, professionaltrade, humanitiessocialsciences, economicseconomic...","[reason, edited, people, interested, economic, social, international, issues, viewpoint, stresse...",professionaleducationaljournals professionaltrade humanitiessocialsciences economicseconomictheo...
1,B00005N7OC,[Written by and for musicians. Covers a variety of musical styles and includes transcriptions fr...,"[Magazine Subscriptions, Arts, Music &amp; Photography, Music]",Written by and for musicians. Covers a variety of musical styles and includes transcriptions fro...,written by and for musicians covers a variety of musical styles and includes transcriptions from...,"[Arts, Music &amp; Photography, Music]","[artsmusicphotography, music]","[written, musicians, covers, variety, musical, styles, includes, transcriptions, recordings, sol...",artsmusicphotography music written musicians covers variety musical styles includes transcriptio...


In [62]:
# meta.to_excel('/content/drive/MyDrive/Machine Learning- Group Project/data/meta_contbased_test5.xlsx')

# Content-Based Recommendation System

In [123]:
cb_df = meta[['asin','bag_of_words']]
print(cb_df.shape)
cb_df.head(3)

(1490, 2)


Unnamed: 0,asin,bag_of_words
0,B00005N7NQ,professionaleducationaljournals professionaltrade humanitiessocialsciences economicseconomictheo...
1,B00005N7OC,artsmusicphotography music written musicians covers variety musical styles includes transcriptio...
2,B00005N7OD,fashionstyle women allure beauty expert every issue full celebrity tips insider secrets pros lik...


In [125]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
count_matrix = cv.fit_transform(cb_df['bag_of_words'])
cosine_sim_mtx = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim_mtx)

[[1.         0.         0.02787473 ... 0.         0.         0.05847053]
 [0.         1.         0.09304842 ... 0.03367175 0.05241424 0.        ]
 [0.02787473 0.09304842 1.         ... 0.21931723 0.03413944 0.06356417]
 ...
 [0.         0.03367175 0.21931723 ... 1.         0.02470831 0.32203059]
 [0.         0.05241424 0.03413944 ... 0.02470831 1.         0.        ]
 [0.05847053 0.         0.06356417 ... 0.32203059 0.         1.        ]]


In [126]:
#index of the product asin for reference
asin_indices = pd.Series(cb_df['asin'])

In [162]:
def recommend(item, cosine_sim = cosine_sim):
    recommended_magazines = []
    idx = asin_indices[asin_indices == item].index[0]
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    top_indices = list(score_series.iloc[1:11].index)
    
    for i in top_indices:
        recommended_magazines.append(list(cb_df['asin'])[i])
        
    return recommended_magazines

#### Note: magazine should be from the cb_df list of magazines

In [129]:
#recommending top 10 magazines - Content Based
for_magazine = 'B00005N7OD'
reco_list = recommend(for_magazine)

['B01H6WOM4O',
 'B00005N7QN',
 'B000K0YFVU',
 'B00JARAU4A',
 'B00007M2OH',
 'B000EU1H3U',
 'B0007INI2C',
 'B00NI7UQDI',
 'B00007AZEO',
 'B019INBTAE']

In [133]:
reco_list.insert(0,for_magazine)
reco_list

['B00005N7OD',
 'B01H6WOM4O',
 'B00005N7QN',
 'B000K0YFVU',
 'B00JARAU4A',
 'B00007M2OH',
 'B000EU1H3U',
 'B0007INI2C',
 'B00NI7UQDI',
 'B00007AZEO',
 'B019INBTAE']

In [145]:
pd.set_option('display.max_colwidth',30)
# pd.set_option('display.max_rowwidth',5)
check_df = meta[meta['asin'].isin(reco_list)]
check_df = check_df[['asin','description','clean_categ']].head(5)

Unnamed: 0,asin,description,clean_categ
2,B00005N7OD,[Allure is the beauty expe...,"[fashionstyle, women]"
50,B00005N7QN,"[Harper's BAZAAR, the fash...","[fashionstyle, women]"
768,B00007AZEO,[Marie Claire Idees focuse...,"[fashionstyle, women]"
883,B00007M2OH,[Glamour UK is Britain s n...,"[fashionstyle, international]"
2159,B0007INI2C,[New Beauty is the first p...,[fashionstyle]


In [161]:
for_magazine2 = 'B00005N7Q5'
reco_list2 = recommend(for_magazine2)
reco_list2

1490


['B000BW56WO',
 'B00H287D44',
 'B000CR6VFE',
 'B000089G4V',
 'B000ILVZJ6',
 'B00006KNWA',
 'B00EV5HJ5E',
 'B00YQH98G0',
 'B00005N7WA',
 'B00005N7WM']

In [163]:
reco_list2.insert(0,for_magazine2)

In [167]:
pd.set_option('display.max_colwidth',50)
check_df = meta[meta['asin'].isin(reco_list2)]
check_df = check_df[['asin','description','clean_categ']]
check_df

Unnamed: 0,asin,description,clean_categ
9,B00005N7Q5,"[America's Number One sportsman's magazine, fe...","[sportsrecreationoutdoors, sportsleisure, hunt..."
76,B00005N7WA,"[The ""Voice of Marine Fishing"", covering ocean...","[sportsrecreationoutdoors, sportsleisure, hunt..."
97,B00005N7WM,"[How-to, seasonal editorial and tournaments fo...","[sportsrecreationoutdoors, sportsleisure, hunt..."
458,B00006KNWA,[Designed for the outdoorsman who participates...,"[sportsrecreationoutdoors, sportsleisure, hunt..."
904,B000089G4V,[A sportsman's guide to the best hunting and f...,"[sportsrecreationoutdoors, sportsleisure, hunt..."
2228,B000BW56WO,"[For the outdoor enthusiast, containing up-to-...","[sportsrecreationoutdoors, sportsleisure, hiki..."
2236,B000CR6VFE,[ATV WORLD Magazine publishes FOUR complete is...,"[sportsrecreationoutdoors, sportsleisure, extr..."
2318,B000ILVZJ6,[Each issue of Petersen's Hunting Magazine has...,"[sportsrecreationoutdoors, sportsleisure, hunt..."
2703,B00EV5HJ5E,[A sportsman's guide to hunting and fishing in...,"[sportsrecreationoutdoors, sportsleisure, hunt..."
2733,B00H287D44,"[A fishing, hunting and camping guide to Wisco...","[sportsrecreationoutdoors, sportsleisure, hunt..."


In [155]:
pd.set_option('display.max_colwidth',300)
cb_df.head(15)

Unnamed: 0,asin,bag_of_words
0,B00005N7NQ,professionaleducationaljournals professionaltrade humanitiessocialsciences economicseconomictheory reason edited people interested economic social international issues viewpoint stresses individual liberty private responsibility limited government emphasis pacific rim localstate national impact ...
1,B00005N7OC,artsmusicphotography music written musicians covers variety musical styles includes transcriptions recordings solo pieces guitar
2,B00005N7OD,fashionstyle women allure beauty expert every issue full celebrity tips insider secrets pros like works overnight lifetime editors pick favorite new products reveal styles really work subscription includes annual special issues makeovers best
3,B00005N7O9,sportsrecreationoutdoors sportsleisure flying flight journal includes articles aviation history contemporary practice military commercial civil aircraft notable personalities technical aspects visits air shows museums reports restorations photography threeview drawings art coverage virtual exper...
4,B00005N7O6,professionaleducationaljournals professionaltrade transportation rider published road street riding motorcycle enthusiastthe enjoys touring sport accent performance week ending also may use machine commuting magazine includes equipment accessory apparel evaluations cycle related travel adventure...
5,B00005N7P0,technology computersinternet maximum pc ultimate upgrade savvy owners every month magazine packed breaking news tons tips amp techniques indepth reviews anywheredesigned rabid hobbyist brings written irreverent edgy style full disclosure modus operandi theres almost overwhelming amount tech spec...
6,B00005N7QG,cookingfoodwine recipestechniques good housekeeping magazine together institute seal american icon consumer protection quality assurance every issue delivers unique mix independent investigation trusted reporting along inspirational personal stories magazines rich tradition embodies commitment m...
7,B00005N7PI,professionaleducationaljournals professionaltrade medicine get expert unbiased reporting live healthier longer discover works doesnt prescription drugs alternative therapies disease prevention nutrition name trust outside advertising
8,B00005N7OP,artsmusicphotography artarthistory magazine reports personalities trends events shape international art world articles focus ranging old masters contemporary genres regular features include reviews books exhibits travel destinations investment appreciation advice insights
9,B00005N7Q5,sportsrecreationoutdoors sportsleisure huntingfirearms americas number one sportsmans magazine featuring indepth articles hunting fishing outdoor adventure conservation news firstclass fiction field stream editorial excellence years
