**Step 1 - Train the engine.**

Create a TF-IDF matrix of unigrams, bigrams, and trigrams for each product. The 'stop_words' param tells the TF-IDF module to ignore common english words like 'the', etc.

Then we compute similarity between all articles using SciKit Leanr's linear_kernel (which in this case is equivalent to cosine similarity).

Iterate through each article's similar articles and store the 100 most-similar. You could show more than 100, your choice.

Similarities and their scores are stored in a dictionary as a list of Tuples, indexed to their post id

In [None]:
# what to install for mysql to work
# ! pip install mysql-connector-python

In [1]:
import mysql.connector
from sqlalchemy import create_engine

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [4]:
#loading the dataset
mydb = mysql.connector.connect(host="remotemysql.com",
                              user="8SawWhnha4",
                              passwd="zFvOBIqbIz",
                              database="8SawWhnha4")

engine = create_engine('mysql+mysqlconnector://8SawWhnha4:zFvOBIqbIz@remotemysql.com/8SawWhnha4')

In [5]:
#fetching the tables in the dataset
dbcursor = mydb.cursor()
dbcursor.execute('show tables')
for table in dbcursor:
    print(table)

('comments',)
('contact_settings',)
('ext_feed_banks',)
('ext_rsses',)
('extfeeds',)
('following',)
('interests',)
('maillists',)
('migrations',)
('notifications',)
('password_resets',)
('posts',)
('thoughts',)
('user_settings',)
('users',)
('users_email_login_tokens',)


In [7]:
#checking out the post table
POSTS = pd.read_sql_query('select * from posts', engine)
POSTS.drop(['user_id', 'tags', 'slug', 'created_at', 'updated_at', 'image',
            'status_id', 'action', 'post_id'], axis=1, inplace=True)
POSTS.head()

Unnamed: 0,id,title,content
0,1,What i have learnt so far on HTML,I learnt how to use the table tag as i have us...
1,2,HTML BEGINS HERE,"I am on this journey with start.ng, and here ..."
2,4,My Laziness In The Open,I have not been attending classes on the HNG c...
3,6,MY TASK 2,My journey on **StartNG** pre-internship progr...
4,7,Task 2,"A Summary on The “idongesit.html” CV, Its Str..."


In [14]:
POSTS.rename(columns={"id":"post_id"}, inplace=True)

In [26]:
POSTS.head(50)

Unnamed: 0,post_id,title,content
0,1,What i have learnt so far on HTML,I learnt how to use the table tag as i have us...
1,2,HTML BEGINS HERE,"I am on this journey with start.ng, and here ..."
2,4,My Laziness In The Open,I have not been attending classes on the HNG c...
3,6,MY TASK 2,My journey on **StartNG** pre-internship progr...
4,7,Task 2,"A Summary on The “idongesit.html” CV, Its Str..."
5,8,My Journey on HTML,Using the Hyper Text Markup Language (HTML) ha...
6,9,StartNG HTML Exposition,![](/storage/2040/images/img-kf6sy3kvg0.png)![...
7,11,MY TASK 2,My journey on **StartNG** pre-internship progr...
8,12,StartNG HTML task,<p> </p>\n<p>I have learned a lot about HTML a...
9,13,On StartNG Pre-Internship,**What I have learned so far**\n\nI have learn...


In [27]:
TF = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
TFIDF_MATRIX = TF.fit_transform(POSTS['title'])

COSINE_SIMILARITIES = linear_kernel(TFIDF_MATRIX, TFIDF_MATRIX)

RESULTS = {}

for idx, row in POSTS.iterrows():
    similar_indices = COSINE_SIMILARITIES[idx].argsort()[:-100:-1]
    similar_items = [(COSINE_SIMILARITIES[idx][i], POSTS['post_id'][i]) for i in similar_indices]

    # First post is the post itself, so remove it.
    # Each dictionary entry is like: [(1,2), (3,4)], with each tuple being (score, post_id)
    RESULTS[row['post_id']] = similar_items[1:]

print('done!')

done!


#### Step 2: Predict!

In [28]:
def post(p_id):
    """ Function to get an article title from the title field,
    given a post ID """
    return POSTS[POSTS['post_id'] == p_id]['title'].tolist()[0].split(' - ')[0]


def recommend(post_id, num):
    """ Function to reads the results out of the dictionary. """
    print("Recommending " + str(num) + " articles similar to " + post(post_id) + "...")
    print("--------------------------------------------------------------------------")
    recs = RESULTS[post_id][:num]
    for rec in recs:
        print("Recommended: " + post(rec[1]) + " (score:" + str(rec[0]) + ")")

# Just plug in any post id here (about 800 posts in the dataset), and the number of recommendations you want (1-99)
# You can get a list of valid post IDs by evaluating the variable 'POSTS'


In [30]:
recommend(post_id=33, num=5)

Recommending 5 articles similar to CURRICULUM VITAE IN HTML...
--------------------------------------------------------------------------
Recommended: curriculum vitae (score:0.6995945842991993)
Recommended: curriculum vitae (score:0.6995945842991993)
Recommended: curriculum vitae (score:0.6995945842991993)
Recommended: My Curriculum Vitae (score:0.6995945842991993)
Recommended: curriculum vitae (score:0.6995945842991993)
