Рекоммендационный алгоритм для заказа рекламы у блоггеров предприятиями.

План: реализовать рекоммендационный алгоритм, основываясь на сравнении тематики текстов авторов и сферы деятельности предприятий.
Выводы будут базироваться на сравнении косинусного расстояния попарно между векторами, описывающими индустрию и тематику текстов авторов. 

In [150]:
# import of necessary libraries
import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
%matplotlib inline  
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import re
import string
import nltk
nltk.download('stopwords')
nltk.download('genesis')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
stop_words = stopwords.words('english')
more_stopwords = ['u', 'im', 'c']
stop_words = stop_words + more_stopwords
stemmer = nltk.SnowballStemmer("english")
# !pip install pyspellchecker
# from spellchecker import SpellChecker
# spell = SpellChecker()
wordnet_lemmatizer = WordNetLemmatizer()
sns.set_theme(context='notebook', style='whitegrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package genesis to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package genesis is already up-to-date!


In [151]:
# useful modules
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('`"\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = ' '.join(word for word in text.split(',') if word not in stop_words)
#     text = ' '.join(stemmer.stem(word) for word in text.split(' '))
    text = ' '.join(wordnet_lemmatizer.lemmatize(word)  for word in text.split(','))
#     text = correct_spellings(text)
    return text

def create_corpus(df):
    corpus=[]
    for tweet in tqdm(df.text_clean):
        words=[word for word in word_tokenize(tweet)]
        corpus.append(words)
    return corpus

def get_recommendation(top, df, scores):
  recommendation = pd.DataFrame(columns = ['Author', 'Name', 'score'])
  count = 0
  for i in top:
      recommendation.at[count, 'Author'] = 7
      recommendation.at[count, 'Name'] = df['Name'][i]
      recommendation.at[count, 'score'] =  scores[count]
      count += 1
  return recommendation

In [152]:
# load, clean, rename, groupby medians for each autor for posts file
df1 = pd.read_csv('posts.csv') 
df1.rename(columns = {"Blogger's ID":'Author ID',"Number of retrieved inlinks":'links'}, inplace = True)
df1.rename(columns = {"Number of retrieved comments":'comments',"Number of comments":'commentsNum'}, inplace = True)

# split title & text
df1['corp'] = df1.Title.fillna('') + " " + df1.Content.fillna('')
df1 = df1.drop(['Title','Content'], axis=1)
# clean text
# df1 = df1.set_index('Author ID')
df1['corp_clean'] = df1['corp'].apply(clean_text)
df1 = df1[['Author ID','corp_clean']]
df1.sample()

Unnamed: 0,Author ID,corp_clean
10811,83,google will show you where to vote it looks li...


В рамках данного проекта выбираю автора с ID 7 и его публикацию под ID 124

In [167]:
sample = df1.iloc[[123]]
sample

Unnamed: 0,Author ID,corp_clean
123,7,opera mini sees million mobile users in febru...


Полный текст публикации:

In [154]:
sample.corp_clean.values[0]

'opera mini sees  million mobile users in february up  percent browser maker opera software has released its latest  of the mobile  report this morning which is based on the usage of its opera mini browser for mobile phones each month the conclusion is always the same mobile web usage around the world keeps on growing and growing in february opera mini had over  million users a  increase from january  and more than  increase compared to february  says that the  million plus users viewed more than  billion pages in february which actually represents a  decrease from january opera claims this is because february only has  days compared to january’s  since february  page views have increased by  in february opera mini users generated over  million mb of data with consumption down by  since february  data traffic is up over  the top  countries for opera mini usage in february remained the same with users  mainly centralized in russia indonesia india china ukraine south africa nigeria the u

Векторизирую текст автора:

In [155]:
#initializing tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,3))
tfidf_autor = tfidf_vectorizer.fit_transform((sample['corp_clean'])) #fitting and transforming the vector


Загружаю сет компаний, выбираю описательный признак деятельности компании для дальнейшей работы:

In [156]:
df = pd.read_csv('companies.csv')
# df = df.set_index('Name')
df = df[['Name','Industry']]
df['Industry'] = df['Industry'].apply(clean_text)
df.sample(1)

Unnamed: 0,Name,Industry
735,not a clue adventures,leisure travel tourism


Векторизирую описательную часть сферы деятельности компаний, составляю матрицу косинусных расстояний между векторами автора и компаний:

In [157]:
from sklearn.metrics.pairwise import cosine_similarity
tfidf_company = tfidf_vectorizer.transform(df['Industry'])
cos_similarity_tfidf = map(lambda x: cosine_similarity(tfidf_company, x),tfidf_autor)

In [160]:
output = list(cos_similarity_tfidf)


В итоге получаю отсортированный фрейм данных с показателем косинусного расстояния (score) для данного автора по представленным компаниям.
Чем выше этот показатель - тем ближе по тематике публикация автора к сфере деятельности компании. 

Опытным путем в дальнейшем следует установить минимальную границу данного показателя, по которой будет происходть отбор кандидатов.

In [182]:
top = sorted(range(len(output[0])), key=lambda i: output[0][i], reverse=True)
list_scores = [output[0][i][0] for i in top]
get_recommendation(top,df, list_scores).sample(5)

Unnamed: 0,Author,Name,score
678,7,jwe corporation,0.0
123,7,xceed systems,0.122426
63,7,delhi recuiters pvt. ltd.,0.181756
351,7,"medvensys, llc",0.0
343,7,vitalia salud,0.0


Вывод: применив данный подход к группе блоггеров из класса 2 (шаг кластеризации блоггеров), можно отобрать те компании, кому можно их рекомендовать для организации рекламных публикаций.