# Artigo 7 - Blog Recommendation

## Objetivo
Para esse artigo, o objetivo é colocar em prática só conhecimentos da lição 7, que se trata de Filtros Colaborativos, disponível em duas partes: [Parte 1](https://course.fast.ai/Lessons/lesson7.html) (do meio para o final) e [Parte 2](https://course.fast.ai/Lessons/lesson8.html) (do início para o meio). Nesse caso o objetivo é mediante a dados tabulares, entender o comportamento de usuários e através disso perceber quais seriam as melhores recomendações para esse usuário. Sendo que para esse caso específico se deseja fazer recomendações de blogs que interessem os usuários, sendo para isso usado os dados presentes no [Blog Recommendation Data](https://www.kaggle.com/datasets/yakshshah/blog-recommendation-data/data). Além de que no final, o resultado será exporto em forma de um modelo disponibilizado no Hugging Face.

## Motivação
O habito de leitura é imprescindível para o crescimento intelectual das pessoas, e esse hábito se torna mais prazeroso quando a pessoa se depara com conteúdos de seu interesse, por isso, ter um modelo que seja capaz de se fazer essa recomendação ajudará com que a pessoa consiga estar diante de conteúdos que goste e possa continuar sua leitura de maneira prazerosa.

## Requisitos
Para se realizar essa tarefa é necessário ter algumas bibliotecas que servirão para poder fazer toda a manipulação dos dados e até mesmo para a criação do modelo, todas elas podem ser encontradas na célula 01.

In [1]:
from fastai.collab import *      #Collab para fazer filtros colaborativos usando fastai
from fastai.tabular.all import * #Tabular para poder mexer com os dados Tabulares

from pathlib import Path #Path para manipular pastas

import numpy as np  #Numby para cálculos
import pandas as pd #Pandas para fazer e manipular DataFrames

set_seed(42) #Seed para poder replicar esse notebook



Outra coisa necessária são os dados, para isso, foi feito o download deles na base de dados [Blog Recommendation Data](https://www.kaggle.com/datasets/yakshshah/blog-recommendation-data/data) e colocado no input para poder ser usado, esses passos podem ser observandos ná célula 02.

In [2]:
base = Path('../input/data-blog')

author_blog = pd.read_csv(base/'Author_Data.csv')
rating_blog = pd.read_csv(base/'Blog_Ratings.csv')
conten_blog = pd.read_csv(base/'Medium_Blog_Data.csv')

Esses dados estão distribuídos da seguinte maneira:
* 0,5 - Se o usuário acabou de ler o blog
* 2 - Se o usuário gostou do blog
* 3,5 - Se o usuário adicionou o blog aos favoritos
* 5 - Se o usuário gostou e adicionou o blog para favoritos

## Dados
Agora, com os dados devidamente lidos, uma primeira coisa a se fazer é visualizar eles para uma primeira análise e entender como estão dispostas as informações e se saber o que se deseja usar de cada um, esse processo pode ser observado nas célula 03 a 05.

In [3]:
author_blog

Unnamed: 0,author_id,author_name
0,1,yaksh
1,2,XIT
2,3,Daniel Meyer
3,4,Seedify Fund
4,5,Ifedolapo Shiloh Olotu
...,...,...
6863,6864,Fresh Frontend Links
6864,6865,Mukesh buwade
6865,6866,Osei Owusu
6866,6867,Yasas Sandeepa


In [4]:
rating_blog

Unnamed: 0,blog_id,userId,ratings
0,9025,11,3.5
1,9320,11,5.0
2,9246,11,3.5
3,9431,11,5.0
4,875,11,2.0
...,...,...,...
200135,6714,22,5.0
200136,6576,22,3.5
200137,6222,22,3.5
200138,6015,22,2.0


In [5]:
conten_blog

Unnamed: 0,blog_id,author_id,blog_title,blog_content,blog_link,blog_img,topic,scrape_time
0,1,4,Let’s Dominate The Launchpad Space Again,"Hello, fam! If you’ve been with us since 2021, you probably remember our first announcements regarding the strategies to dominate the launchpad space in the previous bull. To recall it once more, it was (1) first upgrading our launchpad tier system and then (2) going full-deep into blockchain gaming while…",https://medium.com/@seedifyfund/lets-dominate-the-launchpad-space-again-7155875002f3?source=topics_v2---------0-84--------------------99dddee2_0334_40c4_901c_47e0e6948b6c-------17,https://miro.medium.com/fit/c/140/140/1*nByLJrDhJHndW_wRv4k3JA.png,ai,2023-02-27 07:37:48
1,3,4,Let’s Dominate The Launchpad Space Again,"Hello, fam! If you’ve been with us since 2021, you probably remember our first announcements regarding the strategies to dominate the launchpad space in the previous bull. To recall it once more, it was (1) first upgrading our launchpad tier system and then (2) going full-deep into blockchain gaming while…",https://medium.com/@seedifyfund/lets-dominate-the-launchpad-space-again-7155875002f3?source=topics_v2---------0-84--------------------99dddee2_0334_40c4_901c_47e0e6948b6c-------17,https://miro.medium.com/fit/c/140/140/1*nByLJrDhJHndW_wRv4k3JA.png,ai,2023-02-27 07:41:47
2,4,7,Using ChatGPT for User Research,"Applying AI to 4 common user research activities — User research is a fundamental part of the design process. The more time and energy a product team invests in user research, the higher the chances of releasing a commercially successful product. In this article, I will explore whether ChatGPT can be helpful for user and market research. To make…",https://medium.com/ux-planet/using-chatgpt-for-user-research-5c3bdf7e26af?source=topics_v2---------1-84--------------------99dddee2_0334_40c4_901c_47e0e6948b6c-------17,https://miro.medium.com/fit/c/140/140/1*TZSGnNza4YgHJJ4yeKdVUw.png,ai,2023-02-27 07:41:47
3,5,8,"The Automated Stable-Diffusion Checkpoint Merger, autoMBW.","Checkpoint merging is powerful. The power of checkpoint merging is undeniable. With the introduction of bbc-mc’s sdweb-merge-block-weighted-gui extension, the potential for checkpoint merging has increased exponentially in comparison to older methods. In fact, this method is so powerful that most of all modern merged models incorporate MBW merges in their…",https://medium.com/@media_97267/the-automated-stable-diffusion-checkpoint-merger-autombw-44f8dfd38871?source=topics_v2---------2-84--------------------99dddee2_0334_40c4_901c_47e0e6948b6c-------17,https://miro.medium.com/fit/c/140/140/1*x3N_Hjgu_MjFyc6-kyxUxw.png,ai,2023-02-27 07:41:47
4,6,9,The Art of Lazy Creativity: My Experience Co-Writing a Monty Python Style Sketch with AI,"I was feeling particularly lazy one day and couldn’t be bothered to write anything. So, I turned to ChatGPT, the AI language model, to help me out. But even that was too much effort, so I asked ChatGPT to write this article too. And boy, did it deliver! So, here’s…",https://medium.com/@digitalshedmedia/the-art-of-lazy-creativity-my-experience-co-writing-a-monty-python-style-sketch-with-ai-869bf5ff6a06?source=topics_v2---------3-84--------------------99dddee2_0334_40c4_901c_47e0e6948b6c-------17,https://miro.medium.com/fit/c/140/140/0*m2DdeTvRYz8Cpor2,ai,2023-02-27 07:41:47
...,...,...,...,...,...,...,...,...
10462,10489,6867,Introducing Qwik — A Superfast JavaScript Framework,"An overview of Qwik’s key features and architecture — As you may be aware, numerous JavaScript frameworks have emerged rapidly in the past few years. But this one brings an entirely new rendering paradigm to the table. It’s none other than Qwik. Qwik is as it sounds, super quick. It claims the fastest front-end framework right now. It offers…",https://medium.com/gitconnected/introducing-qwik-a-superfast-javascript-framework-419509a0ca65?source=topics_v2---------204-84--------------------fe2372ea_0f34_4c17_ab3d_2c656faeda5c-------17,https://miro.medium.com/v2/resize:fill:140:140/1*sb72ov-4tO7WTOwbzDipAA.png,web-development,2023-05-08 10:36:42
10463,10490,2490,A Beginner’s Guide to Cypress Testing Framework with a Weather Application,"Cypress is a popular JavaScript testing framework for end-to-end testing of web applications. If you have some experience with unit testing using React Testing Library, you may wonder how to test your app’s interaction with the browser in a real-world scenario. Cypress provides a solution by allowing you to write…",https://medium.com/itnext/a-beginners-guide-to-cypress-testing-framework-with-a-weather-application-88dbf6ddae6c?source=topics_v2---------205-84--------------------fe2372ea_0f34_4c17_ab3d_2c656faeda5c-------17,https://miro.medium.com/v2/resize:fill:140:140/0*hKvrIS2z_K5Fk7U5,web-development,2023-05-08 10:36:42
10464,10491,6810,How To Use Awaited in TypeScript,"Unraveling the magic of Promise handling with TypeScript’s Awaited utility type — Hello there, fellow TypeScript savants! 🎩🐇 In this article, we’re going to unpack the Awaited utility type introduced in TypeScript 4.5. As the name suggests, Awaited helps us handle Promises in TypeScript, specifically mimicking the await keyword’s ability to recursively unwrap Promises. …",https://medium.com/totally-typescript/how-to-use-awaited-in-typescript-66c340dfa491?source=topics_v2---------206-84--------------------fe2372ea_0f34_4c17_ab3d_2c656faeda5c-------17,https://miro.medium.com/v2/resize:fill:140:140/g:fp:0.45:0.54/0*RAPyHzu8Rc19f3m0,web-development,2023-05-08 10:36:42
10465,10492,6868,Angular Hydration? WTF is it and why does it matter?,"How thirsty is your app for better performance? — Angular hydration. At some point, someone might have mentioned it to you in your journey down the long windy road of Angular mastery. In a nutshell, this process restores a server-side rendered application on the client, and improves performance by avoiding extra work to re-create DOM nodes. However, while Angular…",https://medium.com/@PurpleGreenLemon/angular-hydration-wtf-is-it-and-why-does-it-matter-9152d55547c0?source=topics_v2---------207-84--------------------fe2372ea_0f34_4c17_ab3d_2c656faeda5c-------17,https://miro.medium.com/v2/resize:fill:140:140/1*vm65O-xxgnz3zIhsB-gEQA.jpeg,web-development,2023-05-08 10:36:42


Como o intuito é fazer recomendações com base nas avaliações, entende-se que os DataFrames possuem algumas informações desnecessárias para poder criar e treinar o modelo que se deseja, sendo assim, a partir desses 3 DataFrames pode-se criar apenas 1 que possua os seguintes dados:
* User: O ID de usuário
* Blog: O ID do Blog
* Rating: Nota dada ao Blog
* Title: O título do Blog

Isso já demostra um indicativo, que a parte de quem é o autor não é necessário, mas será mantido para quando for retornar o resultado, seja capaz de retorna um valor mais completo com o título do blog e seu autor.

Esse processo de criar um DataFrame que contemple todos esses dados pode ser observado na célula 06.

In [6]:
blogs = rating_blog.copy()
blogs = blogs.merge(conten_blog.drop(['blog_content', 'blog_link', 'blog_img', 'topic', 'scrape_time'], axis=1))
blogs = blogs.merge(author_blog)

blogs

Unnamed: 0,blog_id,userId,ratings,author_id,blog_title,author_name
0,9025,11,3.5,5960,How I became a Frontend Developer,Steven Dornan
1,9025,38,3.5,5960,How I became a Frontend Developer,Steven Dornan
2,9025,253,5.0,5960,How I became a Frontend Developer,Steven Dornan
3,9025,385,0.5,5960,How I became a Frontend Developer,Steven Dornan
4,9025,394,3.5,5960,How I became a Frontend Developer,Steven Dornan
...,...,...,...,...,...,...
200135,6741,4299,5.0,4659,Data Science Interview Questions,Aman Kharwal
200136,4459,3783,5.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200137,4459,4541,0.5,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200138,1353,4415,2.0,1094,Can ChatGPT turn Negative Language into Positive Change?,Eduard Ruzga


Com isso se tem um DatFrame que possuem todos os dados necessários para criar e treinar o modelo e um pouco mais, porém, é verificado um problema, observável na célula 07.

In [7]:
blogs['ratings'].isnull().sum(), blogs['ratings'].isna().sum(), (blogs['ratings'] == 0).sum()

(0, 0, 0)

Isso demostra que todos os valores de avaliações estão preenchidos como valores diferentes de NaN(Not a Number), Null e 0. O que significa que não existem valores a serem preditos para poder fazer a recomendação, um dos motivos de isso ocorrer, é por que existem valores como 0,5, que indicam que o usuário fez a leitura do blog, mas isso não é uma informação útil no rating para fazer a predições, apenas valores acima desses são uteis, pois mostraram o quanto um usuário gostou de um blog, então para ter os campos que se deseja prever, esses valores serão colocados como 0, e a partir de agora indica que um usuário não deu nota para determinado blog, esse processo pode ser observado na célula 08.

In [8]:
blogs['ratings'] = blogs['ratings'].replace(0.5, 0)
blogs

Unnamed: 0,blog_id,userId,ratings,author_id,blog_title,author_name
0,9025,11,3.5,5960,How I became a Frontend Developer,Steven Dornan
1,9025,38,3.5,5960,How I became a Frontend Developer,Steven Dornan
2,9025,253,5.0,5960,How I became a Frontend Developer,Steven Dornan
3,9025,385,0.0,5960,How I became a Frontend Developer,Steven Dornan
4,9025,394,3.5,5960,How I became a Frontend Developer,Steven Dornan
...,...,...,...,...,...,...
200135,6741,4299,5.0,4659,Data Science Interview Questions,Aman Kharwal
200136,4459,3783,5.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200137,4459,4541,0.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200138,1353,4415,2.0,1094,Can ChatGPT turn Negative Language into Positive Change?,Eduard Ruzga


Agora o dado possui valores que vão poder ser preditos, porém, antes que isso possa ser realizado, é interessante verificar se não existe valores duplicados para alguns ids, esse processo pode ser visto na célula 09.

In [9]:
duplicates = blogs.duplicated(subset = ['userId', 'blog_id'], keep = False)
blogs[duplicates]

Unnamed: 0,blog_id,userId,ratings,author_id,blog_title,author_name
8235,6295,13,3.5,63,How to Run Your Own LLaMA,"Dr. Mandar Karhade, MD. PhD."
8262,6295,13,2.0,63,How to Run Your Own LLaMA,"Dr. Mandar Karhade, MD. PhD."
57151,6680,22,0.0,74,Full Analysis with Interactive Dashboard,Amit Kumar
57152,6680,22,5.0,74,Full Analysis with Interactive Dashboard,Amit Kumar
67280,6714,22,0.0,4623,Challenging Assumptions: Preparing for Unwanted Data Results,Humberto Rendon
67281,6714,22,5.0,4623,Challenging Assumptions: Preparing for Unwanted Data Results,Humberto Rendon
74131,6576,22,0.0,1438,“Designing a Data Model for a Library Management System using PostgreSQL.”,Vishal Barvaliya
74132,6576,22,3.5,1438,“Designing a Data Model for a Library Management System using PostgreSQL.”,Vishal Barvaliya
74188,6504,22,0.0,1438,10 Ways to Improve Your SQL Queries: Tips and Techniques,Vishal Barvaliya
74189,6504,22,5.0,1438,10 Ways to Improve Your SQL Queries: Tips and Techniques,Vishal Barvaliya


É possível observar que existem duplicatas, mas considerando as classificações, uma delas diz respeito se o usuário não deu nota a um blog e a outra a sua avaliação sobre eesse mesmo blogue, mas intrinsecamente, se ele avaliou o blog ele o leu, então é possível retirar as duplicatas onde a avaliação é 0. Isso pode ser observado na célula 10.

In [10]:
mask = (blogs.duplicated(subset = ['userId', 'blog_id'], keep = False)) & (blogs['ratings'] == 0.0)
blogs = blogs[~mask]

blogs

Unnamed: 0,blog_id,userId,ratings,author_id,blog_title,author_name
0,9025,11,3.5,5960,How I became a Frontend Developer,Steven Dornan
1,9025,38,3.5,5960,How I became a Frontend Developer,Steven Dornan
2,9025,253,5.0,5960,How I became a Frontend Developer,Steven Dornan
3,9025,385,0.0,5960,How I became a Frontend Developer,Steven Dornan
4,9025,394,3.5,5960,How I became a Frontend Developer,Steven Dornan
...,...,...,...,...,...,...
200135,6741,4299,5.0,4659,Data Science Interview Questions,Aman Kharwal
200136,4459,3783,5.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200137,4459,4541,0.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200138,1353,4415,2.0,1094,Can ChatGPT turn Negative Language into Positive Change?,Eduard Ruzga


Ainda existe uma linha duplicada, observada anteriormente, só que ela possui um rating de 2 e outro de 3.5, nesse caso, será mantido o maior rating, ela pode ser observada na célula 11.

In [11]:
index = blogs[(blogs['userId'] == 13) & (blogs['blog_id'] == 6295) & (blogs['ratings'] == 2.0)].index
blogs = blogs.drop(index)

blogs

Unnamed: 0,blog_id,userId,ratings,author_id,blog_title,author_name
0,9025,11,3.5,5960,How I became a Frontend Developer,Steven Dornan
1,9025,38,3.5,5960,How I became a Frontend Developer,Steven Dornan
2,9025,253,5.0,5960,How I became a Frontend Developer,Steven Dornan
3,9025,385,0.0,5960,How I became a Frontend Developer,Steven Dornan
4,9025,394,3.5,5960,How I became a Frontend Developer,Steven Dornan
...,...,...,...,...,...,...
200135,6741,4299,5.0,4659,Data Science Interview Questions,Aman Kharwal
200136,4459,3783,5.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200137,4459,4541,0.0,3256,Catching Up with KaratDAO (Jan/Feb Updates),KaratDAO
200138,1353,4415,2.0,1094,Can ChatGPT turn Negative Language into Positive Change?,Eduard Ruzga


Apenas para uma melhor organização é interessante mexer nos nomes e em como eles aparecem, isso pode ser observado na célula 12.

In [12]:
blogs = blogs[['userId', 'blog_id', 'ratings', 'blog_title', 'author_id', 'author_name']]
blogs = blogs.rename(columns = {'userId': 'user', 'blog_id':'blog', 'ratings':'rating', 'blog_title':'title', 'author_id':'author', 'author_name':'name'})
blogs

Unnamed: 0,user,blog,rating,title,author,name
0,11,9025,3.5,How I became a Frontend Developer,5960,Steven Dornan
1,38,9025,3.5,How I became a Frontend Developer,5960,Steven Dornan
2,253,9025,5.0,How I became a Frontend Developer,5960,Steven Dornan
3,385,9025,0.0,How I became a Frontend Developer,5960,Steven Dornan
4,394,9025,3.5,How I became a Frontend Developer,5960,Steven Dornan
...,...,...,...,...,...,...
200135,4299,6741,5.0,Data Science Interview Questions,4659,Aman Kharwal
200136,3783,4459,5.0,Catching Up with KaratDAO (Jan/Feb Updates),3256,KaratDAO
200137,4541,4459,0.0,Catching Up with KaratDAO (Jan/Feb Updates),3256,KaratDAO
200138,4415,1353,2.0,Can ChatGPT turn Negative Language into Positive Change?,1094,Eduard Ruzga


Apenas para uma questão de visualização melhor, é interessante representar esse DataFrame de outra maneira, observável na célula 13.

In [13]:
tabela = pd.pivot_table(blogs.iloc[:500], columns='user', values='rating', index='title', fill_value=0)
tabela

user,11,12,14,23,29,37,38,53,54,57,...,4901,4906,4907,4908,4940,4963,4966,4971,4983,4994
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2 ChatGPT (Free) Chrome Extensions so Useful They Almost Feel Illegal,0.0,3.5,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
51 Far Better Things to Do than Scrolling Through Your Smartphone,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Animated Icons: Bottom Nav in Flutter & Rive,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0
April 1st Recommendation on Alignment,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Background Services in Flutter Add-to-App Case,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Calling in the background in a flutter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
"Chat-GPT: 5 Game-Changing Functions in AI, Content, Translation, Classification, and Personalization",0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Conflux Network: A High-Throughput Proof-of-Work Blockchain,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Cryptocurrency Regulations: A Tug of War Between Investors and Bureaucrats,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Diving into HTML and the Tools of the Trade,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Se observa que em alguns casos, alguns blogs poucas pessoas leram, mas isso acontece por que alguns se tratam de conteúdos bem específicos se somente alguns usuários teriam interesse em ler.

### Agrupamento final
Agora com os dados todos organizados, é necessário definir como vai encará-los, com isso, usando a definição dele mesmo como as mudanças, foi definido o seguinte sobre eles:
* 0 - Não leu o blog. (Mesmo que tenha lido, não deu nota, então é como se não tivesse lido)
* 2 - Se o usuário gostou do blog
* 3 - Se o usuário adicionou o blog aos favoritos
* 5 - Se o usuário gostou e adicionou o blog para favoritos

Vai ser deixado somente números inteiros para facilitar a vida do modelo, esse processo pode ser observado na célula 14.

In [14]:
blogs['rating'] = blogs['rating'].astype('int')

## Fatores Latentes
Fatores Latentes são variáveis que não são observadas diretamente nos dados apresentados, mas elas podem ser inferidas indiretamente a partir das variáveis que são observadas. O objetivo é justamente encontrar esses Fatores Latentes tanto para os usuários quanto para os blogs, e através dele se realizar as predições de quais blogs um usuário específico gostaria ou não de ler.

Esses fatores primeiro serão gerados aleatoriamente e serão atualizados conforme o modelo é treinado, sendo usado para isso o gradiente.

## Vies
Outra coisa importante a se considerar é o Vies, pois alguns usuários podem ter a tendência de dar notas mais positivas e outros mais negativas, como também alguns blogs podem tender e receber notas mais positivas e outros mais negativas, então o modelo deve considerar esses aspectos.

## Modelos
Para se começar, o primeiro passo é usar a biblioteca da fastai para poder criar os DataLoaders, usados para carregar os dados de uma forma que possa ser usado para treinar o modelo, esse primeiro processo pode ser observado na célula 15.

In [15]:
dls = CollabDataLoaders.from_df(blogs, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,2120,Protecting Sensitive Data in Python Logging: A Guide to Redacting Information,5
1,4965,"Cryptography — Sign payload, Encrypt a plain text password and Decrypt it",2
2,2608,"Blade Labs Launches Suite of White Label Wallet Products for Android, iOS and Unreal Engine to Accelerate Web3 Adoption by Enterprises",5
3,3895,Ocean Protocol Update || 2023,5
4,1276,Understanding Keras RNN API,3
5,3429,Web Stack Weekly — Issue#62,5
6,2652,How to Use Default Parameters in Java Methods,0
7,721,The LazyLoad | Flutter-Firebase,5
8,4191,Level Up Your Skills with This Beginner Prompt Engineering Guide,5
9,4797,How AI is changing customer service for the better!,5


Para uma primeira tentativa de modelo, será feito uma função que leva em conta o fator de vies e ver como esse modelo se sai, isso pode ser observado na célula 16, com o primeiro passo de se pegar a quantidade de valores de usuários e blogs.

In [16]:
n_users  = len(dls.classes['user'])
n_blogs = len(dls.classes['title'])

Agora se tendo essa informação, pode-se criar a função para ser usada na criação do modelo, isso pode ser observa na célula 17.

In [17]:
class DotProductBias(Module):
    def __init__(self, n_users, n_blogs, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.blog_factors = Embedding(n_blogs, n_factors)
        self.blog_bias = Embedding(n_blogs, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        bloog = self.blog_factors(x[:,1])
        res = (users * bloog).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.blog_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

Agora com a função criada, pode-se criar de fato o modelo e treiná-lo, esse processo pode ser observado na célula 18.

In [18]:
model = DotProductBias(n_users, n_blogs, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 11e-3)

epoch,train_loss,valid_loss,time
0,3.767183,3.882106,00:20
1,3.686265,4.778177,00:20
2,1.543564,5.071566,00:20
3,0.634431,5.019003,00:20
4,0.2357,5.011824,00:20


É possível notar um problema, o modelo diminui o loss de treinamento, mas não o de validação, isso acontece por que ele está apenas decorando os dados de treinamento, para isso é necessário usar a Decadência de peso.

### Decadência de Peso
O que se deseja fazer é penalizar pesos muito altos, para não ocorrer o sobreajuste, e de posse dessa informação, o modelo será treinado novamente, observável na célula 19.

In [19]:
model = DotProductBias(n_users, n_blogs, 50)
learn_c = Learner(dls, model, loss_func=MSELossFlat())
learn_c.fit_one_cycle(5, 11e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,3.788451,3.766351,00:20
1,3.781252,3.799533,00:20
2,3.554136,3.775183,00:20
3,3.104404,3.779299,00:20
4,2.285101,3.793869,00:21


É possível notar que ele ainda sofre de sobreajuste, mas o resultado obtido é melhor do que o primeiro modelo.

## Interpretando os dados
Depois do teste de vários modelos, é interessante observar os resultados, mostrando primeiro quais os blogs estão no topo de cada vies, ou seja, aqueles mais propensos a notas positivas e aqueles mais propensos a notas negativas, isso pode ser observado nas células 20 e 21.

In [20]:
blogs_bias = learn_c.model.blog_bias.weight.squeeze()
idxs = blogs_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

['Outlier Detection Using Principal Component Analysis and Hotelling’s T2 and SPE/DmodX Methods',
 '6 Takeaways from ETHDenver 2023',
 'Power Up Your Data Visuals with Power BI',
 'A quick guide to deploying your Python webapp on Google App Engine',
 'Ensuring Security and Privacy in NLP Models like ChatGPT and Google BERT']

In [21]:
idxs = blogs_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['Top Mobile App Development Trends To Look Out For In 2023',
 'Read Image Using OpenCV Framework',
 '03–14–2023 — Lif3 Update',
 'Upload Your App to the Play Store: A Complete Beginner’s Walkthrough',
 'Open-Source Contribution to NLP packages']

Com isso se pode visualizar aqueles que possuem um vies de receber notas mais altas e notas menores.

## Aprendizado Profundo
Para tentar melhorar o modelo, se passará para o método de aprendizado profundo, onde agora se tem a ideia de criação de camadas com esses valores latentes, e um dos primeiros passos que pode ser realizado é usar uma função do fastai que com base na sua heurística retorne uma quantidade boa de fatores latentes conforme as variáveis usadas, esse processo pode ser encontrado na célula 22.

In [22]:
embs = get_emb_sz(dls)
embs

[(4965, 188), (9706, 273)]

De posse desses valores, pode-se criar a função que fara a criação de um modelo em camadas, podendo ser observado na célula 23.

In [23]:
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

Com tudo isso pode-se criar de fato o modelo e treiná-lo, esse processo pode ser observado na célula 24.

In [24]:
model = CollabNN(*embs)

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100, 50])
learn.fit_one_cycle(5, 11e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,3.625021,3.682752,00:27
1,3.689891,3.676275,00:25
2,3.641419,3.679555,00:25
3,3.652823,3.684319,00:25
4,3.538341,3.855625,00:25


Embora tenha sofrido também com o sobreajuste, ainda se tornou melhor em comparação com os outros.

## Exportando
Para realizar a exportação, é preciso salvar o modelo e os dados de validação, pois serão com base neles que se deseja fazer as predições e as recomendações, esse processo pode ser observado na célula 25 e 26.

In [25]:
#Pegando os dados de validação
valid = dls.valid_ds
export = pd.DataFrame(valid.items)

#Exportando
export.to_csv('valid.csv', index=False)

In [26]:
learn.export('export.pkl')

O modelo em funcionamento pode ser encontrado em [Blogs](https://huggingface.co/spaces/fastaioncampus/Blogs)