<a href="https://colab.research.google.com/github/flaviowu/btc-c14-g4/blob/main/notebooks/test_mod6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Visão geral
Surprise é um scikit Python para construir e analisar sistemas de recomendação que lidam com dados de classificação explícitos.

Surprise foi projetado com os seguintes propósitos em mente :

Dê aos usuários controle perfeito sobre seus experimentos. Para isso, uma forte ênfase é colocada na documentação , que tentamos tornar o mais clara e precisa possível, apontando todos os detalhes dos algoritmos.
Alivie a dor do manuseio do conjunto de dados . Os usuários podem usar conjuntos de dados integrados ( Movielens , Jester ) e seus próprios conjuntos de dados personalizados .
Forneça vários algoritmos de previsão prontos para uso , como algoritmos de linha de base , métodos de vizinhança , baseados em fatoração de matriz ( SVD , PMF , SVD++ , NMF ) e muitos outros . Além disso, várias medidas de similaridade (cos-seno, MSD, pearson…) são incorporadas.
Facilite a implementação de novas ideias de algoritmos .
Fornecer ferramentas para avaliar , analisar e comparar o desempenho dos algoritmos. Os procedimentos de validação cruzada podem ser executados com muita facilidade usando poderosos iteradores de CV (inspirados nas excelentes ferramentas do scikit-learn ), bem como uma pesquisa exaustiva sobre um conjunto de parâmetros .
O nome SurPRISE (aproximadamente :)) significa Simple Python RecommendatIon System Engine .

Observe que a surpresa não suporta classificações implícitas ou informações baseadas em conteúdo.

https://surpriselib.com/

In [None]:
!pip install surprise

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import random

from surprise import SVD
from surprise import NMF
from surprise.model_selection import cross_validate
from surprise import Reader, Dataset

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.model_selection import train_test_split



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.1 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633984 sha256=b4e865bae89029550c6a381fb5e823db695a5bc1def5bbd4a0eaf4a054bba903
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [None]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/modulo 6/train.tsv', sep = '\t')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/modulo 6/test.tsv', sep = '\t')

def data(n, seed):
    
    datas = []
    random.seed(seed)
    for i in range(n):
        dia_maximo = 28
        dia_minimo = 1
        mes_maximo = 13
        mes_minimo = 1
              
        dia = int(random.random() * (dia_maximo - dia_minimo) + dia_minimo)
        mes = int(random.random() * (mes_maximo - mes_minimo) + mes_minimo)
              
        datas.append(str(dia)+'-'+str(mes)+'-2018')
              
    return datas

def estoque(n, seed):
    
  np.random.seed(seed)
  mu, sigma = 1, 20
  s = np.random.normal(mu, sigma, n)
  s[s < 0] = s[s < 0] * -0.5
  s = s.astype(int)
  s[s < 1] = 1
        
  return s

train['date']  = data(n = train.shape[0], seed = 10)
train['stock'] = estoque(n = train.shape[0], seed = 10)

test['date']  = data(n = test.shape[0], seed = 15)
test['stock'] = estoque(n = test.shape[0], seed = 15)

In [None]:
train=train.drop(columns=['brand_name','date','stock'])  #drops da coluna marca , data e stock, em estudos não mostraram que não são importantes

In [None]:
train.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,44.0,0,Complete with certificate of authenticity


In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 7 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   train_id           1482535 non-null  int64  
 1   name               1482535 non-null  object 
 2   item_condition_id  1482535 non-null  int64  
 3   category_name      1476208 non-null  object 
 4   price              1482535 non-null  float64
 5   shipping           1482535 non-null  int64  
 6   item_description   1482531 non-null  object 
dtypes: float64(1), int64(3), object(3)
memory usage: 79.2+ MB


In [None]:
print(train.isnull()) 
print(train.isna())

         train_id   name  item_condition_id  category_name  price  shipping  \
0           False  False              False          False  False     False   
1           False  False              False          False  False     False   
2           False  False              False          False  False     False   
3           False  False              False          False  False     False   
4           False  False              False          False  False     False   
...           ...    ...                ...            ...    ...       ...   
1482530     False  False              False          False  False     False   
1482531     False  False              False          False  False     False   
1482532     False  False              False          False  False     False   
1482533     False  False              False          False  False     False   
1482534     False  False              False          False  False     False   

         item_description  
0                   Fal

In [None]:
train=train.dropna()
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1476204 entries, 0 to 1482534
Data columns (total 7 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   train_id           1476204 non-null  int64  
 1   name               1476204 non-null  object 
 2   item_condition_id  1476204 non-null  int64  
 3   category_name      1476204 non-null  object 
 4   price              1476204 non-null  float64
 5   shipping           1476204 non-null  int64  
 6   item_description   1476204 non-null  object 
dtypes: float64(1), int64(3), object(3)
memory usage: 90.1+ MB


In [None]:
train['price'].max()

2009.0

In [None]:
random_train = train.sample(frac=0.01,random_state=99) # tem que ser muito baixo , crash memória ram acima disso 

In [None]:
random_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14762 entries, 533258 to 351418
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   train_id           14762 non-null  int64  
 1   name               14762 non-null  object 
 2   item_condition_id  14762 non-null  int64  
 3   category_name      14762 non-null  object 
 4   price              14762 non-null  float64
 5   shipping           14762 non-null  int64  
 6   item_description   14762 non-null  object 
dtypes: float64(1), int64(3), object(3)
memory usage: 922.6+ KB


In [None]:
reader = Reader(rating_scale=(0, 2009))# coloquei para ir até o preço max

data = Dataset.load_from_df(random_train[['train_id','price','item_condition_id']], reader)

algo = SVD()
cross_validate(algo, data, measures=['RMSE','MAE'], cv=4, verbose=True)



Evaluating RMSE, MAE of algorithm SVD on 4 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Mean    Std     
RMSE (testset)    0.9084  0.9010  0.8959  0.8923  0.8994  0.0060  
MAE (testset)     0.7901  0.7750  0.7753  0.7773  0.7794  0.0062  
Fit time          0.70    0.57    0.58    0.68    0.63    0.06    
Test time         0.02    0.02    0.02    0.02    0.02    0.00    


{'test_rmse': array([0.90841768, 0.90095664, 0.89585768, 0.89233484]),
 'test_mae': array([0.7901147 , 0.77499951, 0.77529656, 0.77728965]),
 'fit_time': (0.7007725238800049,
  0.5733885765075684,
  0.5755734443664551,
  0.6799583435058594),
 'test_time': (0.0231478214263916,
  0.017055988311767578,
  0.01752018928527832,
  0.017934560775756836)}

In [None]:
#Utilizei o trai_set para o train e test 
train_set = data.build_full_trainset()
test_set = train_set.build_anti_testset()


In [None]:
predictions = algo.fit(train_set).test(test_set)

In [None]:
train.loc[:, 'train_id'] = train.loc[:, 'name'].astype('category').cat.codes

train_price, test_price = train_test_split(train, train_size=0.003) # mais que 0.003 vai crashar 

train_price.reset_index(drop=True, inplace=True)


train_price.head()


Unnamed: 0,train_id,name,item_condition_id,category_name,price,shipping,item_description
0,355304,FOR WARREN ALISHA,3,"Men/Tops/Polo, Rugby",8.0,1,No description yet
1,458727,Harley Davidson Tee-Shirt,2,Men/Tops/T-shirts,14.0,0,Men's L Nice graphic shirt with quality cotton...
2,231835,Bumgenius 4.0 chaplain cloth diaper,1,Kids/Diapering/Cloth Diapers,23.0,0,This is a bumgenius 4.0 pocket diaper in Chapl...
3,640778,MacBook Cover,2,Electronics/Computers & Tablets/Laptops & Netb...,9.0,0,Case cover fits MacBook Pro Model No A1278. In...
4,855586,Pink portable charging bank,2,Electronics/Cell Phones & Accessories/Chargers...,5.0,1,No description yet


In [None]:
train_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4428 entries, 0 to 4427
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   train_id           4428 non-null   int32  
 1   name               4428 non-null   object 
 2   item_condition_id  4428 non-null   int64  
 3   category_name      4428 non-null   object 
 4   price              4428 non-null   float64
 5   shipping           4428 non-null   int64  
 6   item_description   4428 non-null   object 
dtypes: float64(1), int32(1), int64(2), object(3)
memory usage: 225.0+ KB


In [None]:
# o recomendador vai se basear nos dados obtidos do vectorizer da descrição dos itens

print(f"Foi utilizado {len(train_price)} amostras para o treino.")
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df = 0, stop_words='english')

tfidf_matrix = tf.fit_transform(train_price['category_name'])

Foi utilizado 4428 amostras para o treino.


In [None]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

results = {}
for idx, row in train_price.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:100:-1]
    similar_items = [(cosine_similarities[idx][i], train_price['train_id'][i]) for i in similar_indices]
    results[row['train_id']] = similar_items[1:]




In [None]:
def item(id):
    return train_price.loc[train_price['train_id'] == id]['name'].tolist()[0].split(' - ')[0]

def rec(item_id, num):
    print('Recomendo ' + str(num) + ' valores refrente ao produto  ' + item(item_id) + ' .')
    print('-----')
    recs = results[item_id][:num]
    for rec in recs:
        print('Recomendação : ' + item(rec[1]),  )

            

In [None]:
itemId_= train_price.loc[:, 'train_id'].values[0] # primeira linha do df train_price
itemName_=  train_price.loc[train_price['train_id'] == itemId_, 'name'].values[0] 
itemPrice_= train_price.loc[train_price['train_id'] == itemId_, 'price'].values[0]

print(f"o item  {itemId_} , de nome {itemName_} , de valor $ {itemPrice_} \n ")



rec(item_id=itemId_, num=5)

o item  355304 , de nome FOR WARREN ALISHA , de valor $ 8.0 
 
Recomendo 5 valores refrente ao produto  FOR WARREN ALISHA .
-----
Recomendação : Lacoste Polo, size 4 (medium)
Recomendação : 2 x burberry polo shirt for jramirez
Recomendação : Express men's polo bundle
Recomendação : Polo Ralph Lauren Collared Shirt
Recomendação : Banana Republic Men's Medium Polo
